RemiLacroix-IDRIS commented 3 years ago

Hi,

We have noticed that the CUDA version is consistently and noticeably slower than the OpenCL when using the Solis-Wets method.

Here is an example using https://github.com/diogomart/AD-GPU_set_of_42 (nruns=10):

CUDA:
- without overlaping: 268.200 sec
- with overlaping (10 threads): 132.079 sec
OpenCL:
- without overlaping: 230.475 sec
- with overlaping (10 threads): 97.917 sec

Any idea why? @scottlegrand maybe?

Best regards, Rémi

scottlegrand commented 3 years ago

No idea why but you are correct. You're using 64 workers here?

On Fri, Aug 28, 2020, 09:39 Rémi Lacroix notifications@github.com wrote:

Hi,

We have noticed that the CUDA version is consistently and noticeably slower than the OpenCL when using the Solis-Wets method.

Here is an example using https://github.com/diogomart/AD-GPU_set_of_42 (nruns=10):

CUDA:

without overlaping: 268.200 sec

with overlaping (10 threads): 132.079 sec

OpenCL:

without overlaping: 230.475 sec

with overlaping (10 threads): 97.917 sec

Any idea why? @scottlegrand https://github.com/scottlegrand maybe?

Best regards, Rémi

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5WYX5ZNA3DJ67CUMHLSC7MTTANCNFSM4QOKAPBA .

lpmcsn commented 3 years ago

@RemiLacroix-IDRIS and I were working on Tesla V1000. I performed an other test on P100 and noticed 37.5% speed up with OpenCL + Overlap (7cores) compared to CUDA without Overlap.

scottlegrand commented 3 years ago

Could you analyze whether they're doing the same number of energy evaluations and iterations?

On Fri, Aug 28, 2020, 10:47 Pierre DARME notifications@github.com wrote:

@RemiLacroix-IDRIS https://github.com/RemiLacroix-IDRIS and I were working on Tesla V1000. I performed an other test on P100 and noticed 37.5% speed up with OpenCL + Overlap (7cores) compared to CUDA without Overlap.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-682986189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5X4WGGPFEEXOSPU3NDSC7US7ANCNFSM4QOKAPBA .

atillack commented 3 years ago

@RemiLacroix-IDRIS @lpmcsn Thanks, we are aware of this and yet there doesn't seem to be anything in the Cuda SW implementation that should make it slower.

Diogo, Scott, and me looked over that part of the code quite thoroughly. We also added optimizations from Scott to the Cuda SW code in PR #98.

scottlegrand commented 3 years ago

This is hard to analyze because there isn't a practical Nvidia opencl profiler anymore. Or we would have figured this out already

On Fri, Aug 28, 2020, 10:50 Andreas Tillack notifications@github.com wrote:

@RemiLacroix-IDRIS https://github.com/RemiLacroix-IDRIS @lpmcsn https://github.com/lpmcsn Thanks, we are aware of this and yet there doesn't seem to be anything in the Cuda SW implementation that should make it slower.

Diogo, Scott, and me looked over that part of the code quite thoroughly. We also added optimizations from Scott to the Cuda SW code in PR #98 https://github.com/ccsb-scripps/AutoDock-GPU/pull/98.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-682989060, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5SBAX4ILCCREIA6GDDSC7U53ANCNFSM4QOKAPBA .

atillack commented 3 years ago

@scottlegrand Cuda and OpenCL SW should do about the same number of generations in our code. I just tested with make DEVICE={GPU,OCLGPU} NUMWI=128 test - both used about 2.5 M evals and 107 generations on our Titan V node. We also made sure they show the same convergence before merging the Cuda code (and at every successive PR as well) ...

scottlegrand commented 3 years ago

You really ought to run at 64 workers you're going to get the best occupancy. You're mostly wasting half the GPU at 128.

On Fri, Aug 28, 2020, 10:58 Andreas Tillack notifications@github.com wrote:

@scottlegrand https://github.com/scottlegrand Cuda and OpenCL SW should do about the same number of generations in our code. I just tested with make DEVICE=GPU NUMWI=128 test - both used about 2.5 M evals and 107 generations on our Titan V node. We also made sure they show the same convergence before merging the Cuda code (and at every successive PR as well) ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-682994542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5TMBTECFLSNVY3OFNLSC7V5RANCNFSM4QOKAPBA .

lpmcsn commented 3 years ago

I dont know if it's still necessary but the docking parameters are the same for CUDA and OPENCL+OVERLAP :

Number of runs: 20 Number of energy evaluations: 2500000 Number of generations: 27000 Size of population: 150 Rate of crossover: 80.000000% Tournament selection probability limit: 60.000000% Rate of mutation: 2.000000% Maximal allowed delta movement: +/- 6.000000A Maximal allowed delta angle: +/- 90.000000

Rate of local search: 80.000000% Maximal number of local search iterations: 300 Rho lower bound: 0.010000 Spread of local search delta movement: 2.000000A Spread of local search delta angle: 74.999999 Limit of consecutive successes/failures: 4

diogomart commented 3 years ago

@scottlegrand we did some fairly extensive tests a while back and the 64 vs 128 workers situation is not very straightforward. Using OpenCL, we observed better performance with wi=128 for:

GTX980, solis-wets
Vega-56, solits-wets
Vega-56, adadelta

And better performance with wi=64 for

GTX980, adadelta

scottlegrand commented 3 years ago

I never saw 128 workers beat 64 workers under Cuda especially not in that benchmark and there are solid reasons for that being the case. But what you do about that is up to you. if I had the time to work on it I wouldn't have this be a parameter selected at compile time but rather something selected at runtime based on the problem size.

On Fri, Aug 28, 2020, 11:16 Diogo notifications@github.com wrote:

@scottlegrand https://github.com/scottlegrand we did some fairly extensive tests a while back and the 64 vs 128 workers situation is not very straightforward. Using OpenCL, we observed better performance with wi=128 for:

GTX980, solis-wets

Vega-56, solits-wets

Vega-56, adadelta

And better performance with wi=64 for

GTX980, adadelta

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683026507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5U2KTDLDQJKOTWN6DTSC7YBNANCNFSM4QOKAPBA .

scottlegrand commented 3 years ago

Also that's really old data like 2015 old data. You need to do this on a per generation basis at least.

On Fri, Aug 28, 2020, 11:34 Scott Le Grand varelse2005@gmail.com wrote:

I never saw 128 workers beat 64 workers under Cuda especially not in that benchmark and there are solid reasons for that being the case. But what you do about that is up to you. if I had the time to work on it I wouldn't have this be a parameter selected at compile time but rather something selected at runtime based on the problem size.

On Fri, Aug 28, 2020, 11:16 Diogo notifications@github.com wrote:

@scottlegrand https://github.com/scottlegrand we did some fairly extensive tests a while back and the 64 vs 128 workers situation is not very straightforward. Using OpenCL, we observed better performance with wi=128 for:

GTX980, solis-wets

Vega-56, solits-wets

Vega-56, adadelta

And better performance with wi=64 for

GTX980, adadelta

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683026507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5U2KTDLDQJKOTWN6DTSC7YBNANCNFSM4QOKAPBA .

diogomart commented 3 years ago

Those tests I posted are OpenCL.

Also that's really old data like 2015 old data.

Did you just made this up? These tests are from June 2020.

scottlegrand commented 3 years ago

The data may be from June 2020, but the HW is from late 2014....

https://en.wikipedia.org/wiki/GeForce_900_series

Since then, Pascal, Volta, and now almost even Turing have come and gone... And we are in the age of Ampere...

Each GPU generation is as different from the previous one as different actors are portraying the doctor. Metaphorically, they all travel through time in a blue box and save the world again and again, but they each have their own quirks, strengths, and weaknesses...

On Fri, Aug 28, 2020 at 11:52 AM Diogo notifications@github.com wrote:

Those tests I posted are OpenCL.

Also that's really old data like 2015 old data.

Did you just made this up? These tests are from June 2020.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683075243, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5VGR6UDHPFJ4TU7MLLSC74F5ANCNFSM4QOKAPBA .

scottlegrand commented 3 years ago

Check out https://ambermd.org/GPUPerformance.php

A $1000 GPU today is ~4x faster than a GTX980TI from 2015 (which is itself faster than a GTX 980).

But more importantly, the HW cache and threadblock sizes are completely different as is the number of threads in flight needed to achieve peak perf. It's a pain, but IMO this sort of stuff should be managed under the hood by GPU software. All IMO of course...

And absolutely 100% I suspect a similar tale could be told for AMD GPUs.

On Fri, Aug 28, 2020 at 12:10 PM Scott Le Grand varelse2005@gmail.com wrote:

The data may be from June 2020, but the HW is from late 2014....

https://en.wikipedia.org/wiki/GeForce_900_series

Since then, Pascal, Volta, and now almost even Turing have come and gone... And we are in the age of Ampere...

Each GPU generation is as different from the previous one as different actors are portraying the doctor. Metaphorically, they all travel through time in a blue box and save the world again and again, but they each have their own quirks, strengths, and weaknesses...

On Fri, Aug 28, 2020 at 11:52 AM Diogo notifications@github.com wrote:

Those tests I posted are OpenCL.

Also that's really old data like 2015 old data.

Did you just made this up? These tests are from June 2020.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683075243, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5VGR6UDHPFJ4TU7MLLSC74F5ANCNFSM4QOKAPBA .

atillack commented 3 years ago

@scottlegrand I would urge you to keep it civil please and stop this nonsensical bullying. What you say may be true for Cuda but it's neither relevant to Remi's issue nor is it true for OpenCL or some of the cards we've run on.

scottlegrand commented 3 years ago

We see things very differently here clearly. But if you see my inquiry into understanding the root cause of what's actually going on here as "nonsensical bullying" then the best thing for me to do is exit the conversation. Please remove me from further discussion here.

atillack commented 3 years ago

This isn't merely a question is seeing things "differently". We engaged with you in good faith. You did not.

Instead of taking what we wrote at face value you decided first to lecture me on why I hadn't used NUMWI=64 - which is irrelevant to the number of generations which was the question at hand - and then to top it off you decided to deride Diogo when what he wrote didn't fit your narrative. To me, this is bullying, plain and simple.

diogomart commented 3 years ago

@RemiLacroix-IDRIS Maybe this is useful for you. This data is for the Titan-V, which is Volta architecture. Using our branch 9a0d852, OpenCL and ADADELTA, wi=128 is marginally faster (8%), but within the standard deviation (11%).

9a0d852-ocl-64wi-ad--9a0d852-ocl-128wi-ad runtime

diogomart commented 3 years ago

Forgot to mention: number of runs -nrun is 100, which may play a role. The number of systems is 36 and the number of evaluations range from 32k to 2048k.

RemiLacroix-IDRIS commented 3 years ago

I tried 2 different datasets and in all cases CUDA was noticeably slower but I have been doing all my tests with NUMWI=64 and nrun=10, I can try NUMWI=128 and also a higher nrun next week.

@lpmcsn: you need to make sure to compare CUDA/OpenCL with the same overlapping settings (i.e. same OMP_NUM_THREADS).

atillack commented 3 years ago

@RemiLacroix-IDRIS @lpmcsn In the output dlg files there's two timings given: "Run time" and "Idle time" - the run time field is the compute time - which is y in "Job #x took y sec after waiting z sec for setup" in the runtime output. z is the idle time field.

When you add the run times together with and without the file list they typically end up being very close if not the same. The reason is that currently the calculations for each ligand are still sequential. This may change eventually.

RemiLacroix-IDRIS commented 3 years ago

I tried 2 different datasets and in all cases CUDA was noticeably slower but I have been doing all my tests with NUMWI=64 and nrun=10, I can try NUMWI=128 and also a higher nrun next week.

I did more tests using Solis-Wets method and here are my findings:

the OpenCL version is always faster than the CUDA version for a given combination of NUMWI (64 or 128) and nrun (10 or 100)
the OpenCL version is slightly faster with NUMWI=128 than NUMWI=64
the CUDA version is slower with NUMWI=128 than NUMWI=64.

lpmcsn commented 3 years ago

I tried 2 different datasets and in all cases CUDA was noticeably slower but I have been doing all my tests with NUMWI=64 and nrun=10, I can try NUMWI=128 and also a higher nrun next week.

I did more tests using Solis-Wets method and here are my findings:

the OpenCL version is always faster than the CUDA version for a given combination of NUMWI (64 or 128) and nrun (10 or 100)

the OpenCL version is slightly faster with NUMWI=128 than NUMWI=64

the CUDA version is slower with NUMWI=128 than NUMWI=64.

How could we explain this kind of results ?

atillack commented 3 years ago

@RemiLacroix-IDRIS @lpmcsn I've been rerunning SW with nruns=10, lsrat=100.0 and otherwise standard settings and get these runtimes on a Titan V: Cuda

NUMWI=128: 65.01 s 36.76 s

NUMWI=64: 57.93 s 36.80 s

NUMWI=32: 53.63 s 36.85 s

OpenCL

NUMWI=128: 30.28 s 36.88 s

NUMWI=64: 34.43 s 37.08 s

NUMWI=32: 43.07 s 36.98 s

NUMWI=16: 60.70 s 38.78 s

(Cuda's minimum NUMWI is 32 as this is needed for the reduction in the kernels.) This is on the exact same hardware with the exact same options and with Cuda 10.0 and Nvidia's OpenCL driver. It looks like OpenCL and Cuda behave rather differently with respect to the numbers of work units/threads per block.

atillack commented 3 years ago

@diogomart @RemiLacroix-IDRIS @lpmcsn To complete the picture, here are the ADADELTA runtimes I get with nruns=10, lsrat=100.0, and otherwise standard settings on the same Titan V: Cuda

NUMWI=256: 176.07 s 36.74 s

NUMWI=128: 166.27 s 36.84 s

NUMWI=64: 201.10 s 36.78 s

NUMWI=32: 294.14 s 36.89 s

OpenCL

NUMWI=256: 269.84 s 36.83 s

NUMWI=128: 244.37 s 36.89 s

NUMWI=64: 293.58 s 38.72 s

NUMWI=32: 429.71 s 38.65 s

atillack commented 3 years ago

@RemiLacroix-IDRIS @lpmcsn To summarize, it looks like the best you can do is to run a small subset of your calculations with varying NUMWI settings (and maybe even OpenCL vs Cuda as on your machine both should work) and to choose the best ones for your workload. I think Scott's suggestions to make NUMWI a runtime option is not a bad one and it is likely I'll implement this at some point.

RemiLacroix-IDRIS commented 3 years ago

It feels like it should be possible to optimize the CUDA implementation of Solis-Wets to have the same performance as the OpenCL version.

But in the meantime, we will advise our users to check both versions and maybe to prefer the OpenCL version if they use Solis-Wets.

scottlegrand commented 3 years ago

Once again removing myself from this thread. Please do not add me again.

ccsb-scripps / AutoDock-GPU

Solis-Wets slower with CUDA than OpenCL #103

NUMWI=128: 65.01 s 36.76 s

NUMWI=64: 57.93 s 36.80 s

NUMWI=32: 53.63 s 36.85 s

NUMWI=128: 30.28 s 36.88 s

NUMWI=64: 34.43 s 37.08 s

NUMWI=32: 43.07 s 36.98 s

NUMWI=16: 60.70 s 38.78 s

NUMWI=256: 176.07 s 36.74 s

NUMWI=128: 166.27 s 36.84 s

NUMWI=64: 201.10 s 36.78 s

NUMWI=32: 294.14 s 36.89 s

NUMWI=256: 269.84 s 36.83 s

NUMWI=128: 244.37 s 36.89 s

NUMWI=64: 293.58 s 38.72 s

NUMWI=32: 429.71 s 38.65 s