Seems OpenCL is faster than CUDA

ccsb-scripps / AutoDock-GPU

AutoDock for GPUs and other accelerators

https://ccsb.scripps.edu/autodock

GNU General Public License v2.0

366 stars 101 forks source link

Seems OpenCL is faster than CUDA #239

Open daylight-00 opened 10 months ago

daylight-00 commented 10 months ago

I compared the time required by ligand and the time required by whole job, using Autodock-GPU built with CUDA and OpenCL respectively. And it always took CUDA about 30~40% more time than OpenCL. However, the description in the repository says CUDA is faster than OpenCL, contrary to my results. Similar results have always been obtained when trying under different conditions on the same system, and I have not tried on other systems. So I hope others to check if CUDA is really faster than OpenCL.

System:

AMD EPYC 7542 32-Core Processor
NVIDIA Geforce RTX 3090 * 1
CUDA Toolkit 11.8 (Conda Package)

Docking:

ligand batch size: 10K
nrun: 10
iteration of entire job: 10
random seed: 100

output

atillack commented 10 months ago

@daylight0-0 Thank you and yes, OpenCL is about 5-15% faster in our own testing on the same hardware (RTX A5000). Newer versions should narrow the gap a little bit (due to requesting a smaller chunk of memory in Cuda similar to OpenCL based on the actual memory needed and not the maximums) - so if this isn't the current develop branch it may be worthwhile to test again.

I suspect the remaining difference may be caused by pre-allocated memory at compile time (OpenCL) vs dynamically allocated memory at runtime (Cuda) for variables in shared memory - as other than this Cuda and OpenCL paths are using exactly the same algorithms and even implementations as much as possible ...

Since OpenCL does exist on Nvidia and many more devices (all the way to Android) that's good news though ultimately :-)

atillack commented 10 months ago

Found the culprit: It looks like I wrote that Cuda was faster about 3 years ago in our README.md. It probably was true at the time before I merged the integer gradient from Cuda to OpenCL as well. So I'll fix README.md by taking this sentence out.

daylight-00 commented 10 months ago

Thank you for your answer. I did use the develop branch though. I'm using AutoDock-GPU in a cluster that uses various types or number of gpu(A5000, A6000, 3090...), and I wonder if there could be a problem if I use it in a different node than when I do compile.

atillack commented 10 months ago

For Cuda this should only be an issue if you were to compile with the wrong architecture(s) - for 3090/A5000/A6000. you want to compile with TARGETS="86".

One more thing: I would only compare overall runtimes on the same machines as the kernel runtime performance timers while at the same location may still contain different tasks depending on what Cuda and OpenCL do at kernel cleanup time.

atillack commented 10 months ago

I just realized, PR #233 should close the Cuda performance gap a bit more as it contains the code to allocate the same amount of memory as OpenCL ...