ccsb-scripps / AutoDock-GPU

AutoDock for GPUs and other accelerators
https://ccsb.scripps.edu/autodock
GNU General Public License v2.0
368 stars 102 forks source link

void setup_gpu_for_docking error #147

Open Vardanos opened 3 years ago

Vardanos commented 3 years ago

Hello, Sometimes AutoDock-GPU exits with the following error:

autodock_gpu: ./host/src/performdocking.cpp:234: void setup_gpu_for_docking(GpuData&, GpuTempData&):
Assertion `0' failed.

We are unable to reproduce the error for the same molecule even with multiple repetitions. It occurs randomly after some version change, on multiple GPUs(3080,2080ti, K80) with different CUDA versions (11.1, 10.2). We would be grateful if you could help us. Thanks in advance.

atillack commented 3 years ago

@Vardanos Which version of the code are you running?

Based on what you wrote it seems like you are running the same binary on multiple very different GPUs. Please make sure that you compiled for each of your GPU's compute capabilities (you can use make DEVICE=CUDA NUMWI=128 TARGETS="37 75" - with your respective compute capabilities).

Vardanos commented 3 years ago

@atillack Sorry for the late reply.

Which version of the code are you running?

We are running the latest version available on this repository. We've also tried compiling binaries from releases v.1.3 and v1.4

running the same binary on multiple very different GPU

We actually compile for each GPU separately using respective compute capabilities. After further debugging it seems it is stable on 10.1 CUDA version. We tracked GPU memory usage on CUDA 11.1 version and it looks like there is memory overflow. Here is the log of GPU memory usage over time, where you can see that memory.used [MiB] increases over time and then after error occurs, the memory is freed.

utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 2 %, 10015 MiB, 9627 MiB, 388 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 2 %, 10015 MiB, 9627 MiB, 388 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
8 %, 0 %, 10015 MiB, 9083 MiB, 932 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
7 %, 0 %, 10015 MiB, 8377 MiB, 1638 MiB
....
....
....
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
100 %, 0 %, 10015 MiB, 3696 MiB, 6319 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
100 %, 0 %, 10015 MiB, 3698 MiB, 6317 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
100 %, 0 %, 10015 MiB, 3698 MiB, 6317 MiB
....
....
....
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 10015 MiB, 1315 MiB, 8700 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 10015 MiB, 1315 MiB, 8700 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 10015 MiB, 9645 MiB, 370 MiB
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 10015 MiB, 9645 MiB, 370 MiB

We saw that AutoDock is written to support CUDA v8.0, v9.0, and v10.0, so perhaps its the CUDA version that is causing the memory leak. We will update on our further investigation to make sure there is nothing else that eats GPU memory.

Thanks for your assistance!

atillack commented 3 years ago

@Vardanos Thank you for the update. We did run on Cuda 10.2 and 11.0 internally (and I still need to fix mentioning Cuda 8 in the README.md as we do need Cuda >= 9 ...).

This looks like memory either not properly being freed on the GPU or being continuously allocated. Neither of which is supposed to happen.

When running a job with 10,000 ligands (using the filelist feature) I do not see this kind of accumulation using the OpenCL code path (make DEVICE=OCLGPU on Cuda-enabled systems) so in the interim you could test if this works for you as well.

Which command line options did you use to run the jobs (i.e. filelist, device settings, etc. ...)?

Vardanos commented 3 years ago

@atillack thanks for your reply.

Which command line options did you use to run the jobs (i.e. filelist, device settings, etc. ...)?

We are using the following command for running AutoDock GPU:

 ./autodock_128wi 
--ffile <protein>.maps.fld 
--lfile <ligand>.pdbqt 
--nrun 50

We are not using --filelist for batch processing, as its not applicable for our use case.

...OpenCL code path (make DEVICE=OCLGPU on Cuda-enabled systems) so in the interim you could test if this works for you as well.

For OpenCL version, we tested using make DEVICE=OCLGPU NUMWI=128. We got the following error after docking ~500 ligands sequentially:

AutoDock-GPU version: v2.0-156-g30f42c21eff6cfc0a960e368a740c4116567ef42-dirty

Running 1 docking calculation

Kernel source used for development:      ./device/calcenergy.cl                  
Kernel string used for building:         ./host/inc/stringify.h                  
Kernel compilation flags:                 -I ./device -I ./common -DN128WI   -cl-mad-enable
OpenCL device:                           GeForce RTX 3080
Error: clCreateContext() -6
atillack commented 3 years ago

@Vardanos When you have a moment, please post July 2021's lottery numbers as we're currently on version 1.4 while your version string shows v2.0 :-D

Joking aside, the behavior you're observing - memory accumulation across successive separate code instances - has me a bit stumped right now. I have run 10,000 ligand dockings like you did in successive runs using the current OpenCL and Cuda code with version 1.4.3 (on an old GTX 1080 with driver 450.66 and Cuda 11.0) and cannot reproduce this unfortunately. I also don't see any forgotten memory deallocations in the code which I would have suspected if you used the --filelist feature. Of course, this doesn't mean there aren't any bugs in our code ...

One thing I am wondering about is wether your runs crash at the same ligand/protein system and if what you're seeing could be a system that is too big to fit into GPU memory as the maximum memory the code currently can request based on the limits for the number of atoms and grid sizes is about 16 GB.

In your Cuda output at beginning of each run there should be an output telling you how much memory is available before memory is allocated for the given system. Here is an example from my testing for this issue:

AutoDock-GPU version: 1.4.3-1-gdb7968b17241f947da6e30d9f2ae4aefa26be518

Running 1 docking calculation

Cuda device: GeForce GTX 1080 Available memory on device: 8010 MB (total: 8119 MB)

Could you post the one you're seeing when it crashes?

atillack commented 2 years ago

@Vardanos Version 1.5 is out which contains a major rewrite of many aspects of the code including some changes in the area that crashed for you. While I can't conclusively say this will fix your issue as I am unable to reproduce it unfortunately, there is a good chance it might :-)