Closed RemiLacroix-IDRIS closed 3 years ago
No idea why but you are correct. You're using 64 workers here?
On Fri, Aug 28, 2020, 09:39 Rémi Lacroix notifications@github.com wrote:
Hi,
We have noticed that the CUDA version is consistently and noticeably slower than the OpenCL when using the Solis-Wets method.
Here is an example using https://github.com/diogomart/AD-GPU_set_of_42 (nruns=10):
- CUDA:
- without overlaping: 268.200 sec
- with overlaping (10 threads): 132.079 sec
- OpenCL:
- without overlaping: 230.475 sec
- with overlaping (10 threads): 97.917 sec
Any idea why? @scottlegrand https://github.com/scottlegrand maybe?
Best regards, Rémi
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5WYX5ZNA3DJ67CUMHLSC7MTTANCNFSM4QOKAPBA .
@RemiLacroix-IDRIS and I were working on Tesla V1000. I performed an other test on P100 and noticed 37.5% speed up with OpenCL + Overlap (7cores) compared to CUDA without Overlap.
Could you analyze whether they're doing the same number of energy evaluations and iterations?
On Fri, Aug 28, 2020, 10:47 Pierre DARME notifications@github.com wrote:
@RemiLacroix-IDRIS https://github.com/RemiLacroix-IDRIS and I were working on Tesla V1000. I performed an other test on P100 and noticed 37.5% speed up with OpenCL + Overlap (7cores) compared to CUDA without Overlap.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-682986189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5X4WGGPFEEXOSPU3NDSC7US7ANCNFSM4QOKAPBA .
@RemiLacroix-IDRIS @lpmcsn Thanks, we are aware of this and yet there doesn't seem to be anything in the Cuda SW implementation that should make it slower.
Diogo, Scott, and me looked over that part of the code quite thoroughly. We also added optimizations from Scott to the Cuda SW code in PR #98.
This is hard to analyze because there isn't a practical Nvidia opencl profiler anymore. Or we would have figured this out already
On Fri, Aug 28, 2020, 10:50 Andreas Tillack notifications@github.com wrote:
@RemiLacroix-IDRIS https://github.com/RemiLacroix-IDRIS @lpmcsn https://github.com/lpmcsn Thanks, we are aware of this and yet there doesn't seem to be anything in the Cuda SW implementation that should make it slower.
Diogo, Scott, and me looked over that part of the code quite thoroughly. We also added optimizations from Scott to the Cuda SW code in PR #98 https://github.com/ccsb-scripps/AutoDock-GPU/pull/98.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-682989060, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5SBAX4ILCCREIA6GDDSC7U53ANCNFSM4QOKAPBA .
@scottlegrand Cuda and OpenCL SW should do about the same number of generations in our code. I just tested with make DEVICE={GPU,OCLGPU} NUMWI=128 test
- both used about 2.5 M evals and 107 generations on our Titan V node. We also made sure they show the same convergence before merging the Cuda code (and at every successive PR as well) ...
You really ought to run at 64 workers you're going to get the best occupancy. You're mostly wasting half the GPU at 128.
On Fri, Aug 28, 2020, 10:58 Andreas Tillack notifications@github.com wrote:
@scottlegrand https://github.com/scottlegrand Cuda and OpenCL SW should do about the same number of generations in our code. I just tested with make DEVICE=GPU NUMWI=128 test - both used about 2.5 M evals and 107 generations on our Titan V node. We also made sure they show the same convergence before merging the Cuda code (and at every successive PR as well) ...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-682994542, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5TMBTECFLSNVY3OFNLSC7V5RANCNFSM4QOKAPBA .
I dont know if it's still necessary but the docking parameters are the same for CUDA and OPENCL+OVERLAP :
Number of runs: 20 Number of energy evaluations: 2500000 Number of generations: 27000 Size of population: 150 Rate of crossover: 80.000000% Tournament selection probability limit: 60.000000% Rate of mutation: 2.000000% Maximal allowed delta movement: +/- 6.000000A Maximal allowed delta angle: +/- 90.000000
Rate of local search: 80.000000% Maximal number of local search iterations: 300 Rho lower bound: 0.010000 Spread of local search delta movement: 2.000000A Spread of local search delta angle: 74.999999 Limit of consecutive successes/failures: 4
@scottlegrand we did some fairly extensive tests a while back and the 64 vs 128 workers situation is not very straightforward. Using OpenCL, we observed better performance with wi=128 for:
And better performance with wi=64 for
I never saw 128 workers beat 64 workers under Cuda especially not in that benchmark and there are solid reasons for that being the case. But what you do about that is up to you. if I had the time to work on it I wouldn't have this be a parameter selected at compile time but rather something selected at runtime based on the problem size.
On Fri, Aug 28, 2020, 11:16 Diogo notifications@github.com wrote:
@scottlegrand https://github.com/scottlegrand we did some fairly extensive tests a while back and the 64 vs 128 workers situation is not very straightforward. Using OpenCL, we observed better performance with wi=128 for:
- GTX980, solis-wets
- Vega-56, solits-wets
- Vega-56, adadelta
And better performance with wi=64 for
- GTX980, adadelta
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683026507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5U2KTDLDQJKOTWN6DTSC7YBNANCNFSM4QOKAPBA .
Also that's really old data like 2015 old data. You need to do this on a per generation basis at least.
On Fri, Aug 28, 2020, 11:34 Scott Le Grand varelse2005@gmail.com wrote:
I never saw 128 workers beat 64 workers under Cuda especially not in that benchmark and there are solid reasons for that being the case. But what you do about that is up to you. if I had the time to work on it I wouldn't have this be a parameter selected at compile time but rather something selected at runtime based on the problem size.
On Fri, Aug 28, 2020, 11:16 Diogo notifications@github.com wrote:
@scottlegrand https://github.com/scottlegrand we did some fairly extensive tests a while back and the 64 vs 128 workers situation is not very straightforward. Using OpenCL, we observed better performance with wi=128 for:
- GTX980, solis-wets
- Vega-56, solits-wets
- Vega-56, adadelta
And better performance with wi=64 for
- GTX980, adadelta
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683026507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5U2KTDLDQJKOTWN6DTSC7YBNANCNFSM4QOKAPBA .
Those tests I posted are OpenCL.
Also that's really old data like 2015 old data.
Did you just made this up? These tests are from June 2020.
The data may be from June 2020, but the HW is from late 2014....
https://en.wikipedia.org/wiki/GeForce_900_series
Since then, Pascal, Volta, and now almost even Turing have come and gone... And we are in the age of Ampere...
Each GPU generation is as different from the previous one as different actors are portraying the doctor. Metaphorically, they all travel through time in a blue box and save the world again and again, but they each have their own quirks, strengths, and weaknesses...
On Fri, Aug 28, 2020 at 11:52 AM Diogo notifications@github.com wrote:
Those tests I posted are OpenCL.
Also that's really old data like 2015 old data.
Did you just made this up? These tests are from June 2020.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683075243, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5VGR6UDHPFJ4TU7MLLSC74F5ANCNFSM4QOKAPBA .
Check out https://ambermd.org/GPUPerformance.php
A $1000 GPU today is ~4x faster than a GTX980TI from 2015 (which is itself faster than a GTX 980).
But more importantly, the HW cache and threadblock sizes are completely different as is the number of threads in flight needed to achieve peak perf. It's a pain, but IMO this sort of stuff should be managed under the hood by GPU software. All IMO of course...
And absolutely 100% I suspect a similar tale could be told for AMD GPUs.
On Fri, Aug 28, 2020 at 12:10 PM Scott Le Grand varelse2005@gmail.com wrote:
The data may be from June 2020, but the HW is from late 2014....
https://en.wikipedia.org/wiki/GeForce_900_series
Since then, Pascal, Volta, and now almost even Turing have come and gone... And we are in the age of Ampere...
Each GPU generation is as different from the previous one as different actors are portraying the doctor. Metaphorically, they all travel through time in a blue box and save the world again and again, but they each have their own quirks, strengths, and weaknesses...
On Fri, Aug 28, 2020 at 11:52 AM Diogo notifications@github.com wrote:
Those tests I posted are OpenCL.
Also that's really old data like 2015 old data.
Did you just made this up? These tests are from June 2020.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ccsb-scripps/AutoDock-GPU/issues/103#issuecomment-683075243, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJUV5VGR6UDHPFJ4TU7MLLSC74F5ANCNFSM4QOKAPBA .
@scottlegrand I would urge you to keep it civil please and stop this nonsensical bullying. What you say may be true for Cuda but it's neither relevant to Remi's issue nor is it true for OpenCL or some of the cards we've run on.
We see things very differently here clearly. But if you see my inquiry into understanding the root cause of what's actually going on here as "nonsensical bullying" then the best thing for me to do is exit the conversation. Please remove me from further discussion here.
This isn't merely a question is seeing things "differently". We engaged with you in good faith. You did not.
Instead of taking what we wrote at face value you decided first to lecture me on why I hadn't used NUMWI=64
- which is irrelevant to the number of generations which was the question at hand - and then to top it off you decided to deride Diogo when what he wrote didn't fit your narrative. To me, this is bullying, plain and simple.
@RemiLacroix-IDRIS Maybe this is useful for you. This data is for the Titan-V, which is Volta architecture. Using our branch 9a0d852
, OpenCL and ADADELTA, wi=128 is marginally faster (8%), but within the standard deviation (11%).
Forgot to mention: number of runs -nrun
is 100, which may play a role. The number of systems is 36 and the number of evaluations range from 32k to 2048k.
I tried 2 different datasets and in all cases CUDA was noticeably slower but I have been doing all my tests with NUMWI=64
and nrun=10
, I can try NUMWI=128
and also a higher nrun
next week.
@lpmcsn: you need to make sure to compare CUDA/OpenCL with the same overlapping settings (i.e. same OMP_NUM_THREADS
).
@RemiLacroix-IDRIS @lpmcsn In the output dlg files there's two timings given: "Run time" and "Idle time" - the run time field is the compute time - which is y in "Job #x took y sec after waiting z sec for setup" in the runtime output. z is the idle time field.
When you add the run times together with and without the file list they typically end up being very close if not the same. The reason is that currently the calculations for each ligand are still sequential. This may change eventually.
I tried 2 different datasets and in all cases CUDA was noticeably slower but I have been doing all my tests with
NUMWI=64
andnrun=10
, I can tryNUMWI=128
and also a highernrun
next week.
I did more tests using Solis-Wets method and here are my findings:
NUMWI
(64 or 128) and nrun
(10 or 100)NUMWI=128
than NUMWI=64
NUMWI=128
than NUMWI=64
.I tried 2 different datasets and in all cases CUDA was noticeably slower but I have been doing all my tests with
NUMWI=64
andnrun=10
, I can tryNUMWI=128
and also a highernrun
next week.I did more tests using Solis-Wets method and here are my findings:
- the OpenCL version is always faster than the CUDA version for a given combination of
NUMWI
(64 or 128) andnrun
(10 or 100)- the OpenCL version is slightly faster with
NUMWI=128
thanNUMWI=64
- the CUDA version is slower with
NUMWI=128
thanNUMWI=64
.
How could we explain this kind of results ?
@RemiLacroix-IDRIS @lpmcsn I've been rerunning SW with nruns=10
, lsrat=100.0
and otherwise standard settings and get these runtimes on a Titan V:
Cuda
OpenCL
(Cuda's minimum NUMWI
is 32 as this is needed for the reduction in the kernels.) This is on the exact same hardware with the exact same options and with Cuda 10.0 and Nvidia's OpenCL driver. It looks like OpenCL and Cuda behave rather differently with respect to the numbers of work units/threads per block.
@diogomart @RemiLacroix-IDRIS @lpmcsn To complete the picture, here are the ADADELTA runtimes I get with nruns=10
, lsrat=100.0
, and otherwise standard settings on the same Titan V:
Cuda
OpenCL
@RemiLacroix-IDRIS @lpmcsn To summarize, it looks like the best you can do is to run a small subset of your calculations with varying NUMWI
settings (and maybe even OpenCL vs Cuda as on your machine both should work) and to choose the best ones for your workload. I think Scott's suggestions to make NUMWI
a runtime option is not a bad one and it is likely I'll implement this at some point.
It feels like it should be possible to optimize the CUDA implementation of Solis-Wets to have the same performance as the OpenCL version.
But in the meantime, we will advise our users to check both versions and maybe to prefer the OpenCL version if they use Solis-Wets.
Once again removing myself from this thread. Please do not add me again.
Hi,
We have noticed that the CUDA version is consistently and noticeably slower than the OpenCL when using the Solis-Wets method.
Here is an example using https://github.com/diogomart/AD-GPU_set_of_42 (nruns=10):
Any idea why? @scottlegrand maybe?
Best regards, Rémi