CUDA_VISIBLE_DEVICES not being used to set device number

abazabaaa commented 3 years ago

Hi,

I am utilizing the software on a cluster that contains 2 GPUs per node. Prior to startup, CUDA_VISIBLE_DEVICES is set to 0 or 1 based on what is presently being used.

My jobs consistently fail on nodes were another job is present. Below is an example output. All of the files produced are core.dumps.

In this case CUDA_VISIBLE_DEVICES = 1

How does Autodock GPU pick a device? I see that devnum is set to 1 by default and counting begins at 1. Device numbers for NVIDIA GPUS almost always start at 0 for most configurations.

Is there a way to set devnum to the output of Cuda_vis_devs?

CUDA_VISIBLE_DEVICES 1 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176 nvidia-smi Wed Feb 3 17:49:40 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:58:00.0 Off | 0 | | N/A 50C P0 85W / 250W | 8797MiB / 12198MiB | 100% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:D8:00.0 Off | 2 | | N/A 23C P0 24W / 250W | 1MiB / 12198MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 269000 C ...testing/gandlf_mine/venv10.2/bin/python 8787MiB | +-----------------------------------------------------------------------------+ AutoDock-GPU version: v1.3 cudaDeviceSetLimit failed out of memory AutoDock-GPU version: v1.3 cudaDeviceSetLimit failed out of memory AutoDock-GPU version: v1.3 cudaDeviceSetLimit failed out of memory AutoDock-GPU version: v1.3 cudaDeviceSetLimit failed out of memory AutoDock-GPU version: v1.3

atillack commented 3 years ago

@abazabaaa The command line option to use is -devnum which starts counting at 1 - one more than what nvidia-smi shows- so if you want device 0 it would be -devnum 1 and for device 1 it would be -devnum 2.

As for CUDA_VISIBLE_DEVICES - that's an interesting one as from the programs perspective all devices are visible. the trick we chose to make it work is that when you don't use the -devnum command line option it actually is not chosing device 0 but whichever device the driver has as the current device ... In other words, CUDA_VISIBLE_DEVICES=1,0 ./autodock_gpu_xxxwi -ffile etc ... should work ...

I will test on our cards with CUDA_VISIBLE_DEVICES, for now, I would recommend using -devnum 2 if you can.

abazabaaa commented 3 years ago

Thanks.

We have identified another potential issue and ruled out CUDA_VISIBLE_DEVICES. I have extensively tested this and found numerous nodes that succeed with CUDA_VISIBLE_DEVICES defined dev numbers.

Instead it seems to be an issue with ECC errors on specific cards. If you look at nvidia-smi, you will see that card 2 has Volatile Uncorr. ECC = 2 in addition to having 1 MiB/ 12198MiB.

When Autodock-gpu requests memory how much does it ask for? Is it possible that it is requesting the total amount on the card and the difference (1 MiB) means it ends up with less than requested? Can one hardcode the memory requests and limit this process (in the interest of debugging). This seems like it could lead to seg faults and memory dumps. The output from Autodock is an out-of-memory error.

atillack commented 3 years ago

@abazabaaa Very interesting. Depending on the maps and ligand number of atoms AD-GPU uses at most 16 GB of memory - for more typically sized systems this is usually much less and closer to 1 GB - but what's causing your crashes seems to be the cudaDeviceSetLimit call in performdocking.cpp.Cuda:134 which tries to reserve a Fifo buffer of (at most) 8 GB. Thing is though, this looks like left-over debugging code to me - when not debugging on the GPU it's not really needed.

With the ECC error you're seeing it's possible commenting this line (and its error check) out may fix your issue - at the same time, depending on how your driver allocates memory you may also get "unlucky" and end up in a "bad" memory location again by chance.

I just tested our test set with lines 134 and 135 commented out in performdocking.cpp.Cuda which worked fine. I will continue testing and will likely add a fix in an upcoming PR - in the interim, please test and see if this might already work for you :-)

abazabaaa commented 3 years ago

@atillack Thanks! I have a list of nodes with the impacted gpus. I will run a few trials and report back. Happy to provide any digging information if it would help you. Just let me know.

abazabaaa commented 3 years ago

@atillack Unfortunately commenting out those lines of code did not help.

My jobs generally do not take more than 2-3G of vram, so it is a bit puzzling why this would be an issue.

Any other thoughts on things to try?

atillack commented 3 years ago

@abazabaaa Is there a different error message now that you commented out these two lines (shown here just to be sure)? status = cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 200000000ull); RTERROR(status, "cudaDeviceSetLimit failed");

It looks like your card reported two uncorrectable memory errors. While this looks like it's probably small enough to not be responsible you could reset the counter to zero with nvidia-smi -i 1 -p 0 and see if it increases after you run. If it does then this could indicate an ongoing issue with the card itself - but again, 2 errors is likely nothing to worry about...

As you are on a compute cluster which is likely set to EXCLUSIVE mode for each GPU (i believe "E. Process" in your nvidia-smi output is what this means) it could be a case of your process ending up on the wrong GPU (the one that already has a process on it). How are you currently running? I would try both without -devnum and with -devnum 1 (this should pick the first one in the CUDA_VISIBLE_DEVICES list). I realize this is in contradiction to what I wrote above - i was under the impression this is a workstation (not reading carefully - mea culpa) where you are in control of setting CUDA_VISIBLE_DEVICES and that you could run without it being set (in which case -devnum 2 should work).

atillack commented 3 years ago

@abazabaaa One other thing to try in a case like this is to get hold of one node in "interactive mode" - aka launch a shell with the job scheduler that you can use - to manually run things from. There, I would try to run on each GPU individually with -devnum and if that works try running concurrently on both devices - next would be to test with CUDA_VISIBLE_DEVICES to see with which GPU you'll end up w/o devnum and which with -devnum 1 ... The 1 MiB / 12198 MiB means the GPU has all memory available - even on later drivers nvidia-smi seems to start counting at 1 (like our Cuda code, must be infectious :-D). In any case, unless there's an actual hardware fault or a node that needs to be rebooted (this sometimes happens too) my best bet is that it's a case of ending up on the wrong GPU ;-)

abazabaaa commented 3 years ago

@atillack I am actually not able to operate the nodes on interactive mode. The interactive nodes we have access to only have 1 GPU.

I am not inclined to think this is a device number error as I am submitting 35 parallel jobs that generally hit busy nodes (meaning GPU 0 is taken) and these jobs always succeed. I have a list of nodes that errors associated with their memory and so far these are the only jobs that fail. There are a total of 100 P100 GPUs on the compute cluster and all core dumps are logged -- Autodock is the only program in the last month to register a core dump so I think it is specific to how the program is dealing with gpu memory. That being said, I am by no means an expert when it comes to debugging cuda processes.. just ruling out the possibilities.

So I am getting a new error type now.. so there is progress:, thoughts?

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1196 C python 11775MiB | +-----------------------------------------------------------------------------+ AutoDock-GPU version: v1.3-17-g18266b7fea8774b06faa92537a9997b0957b4951-dirty

CUDA Setup time 0.093031s cData.pKerconst_interintra: failed to allocate GPU memory. out of memory AutoDock-GPU version: v1.3-17-g18266b7fea8774b06faa92537a9997b0957b4951-dirty

abazabaaa commented 3 years ago

@atillack Here are two examples were the program runs correctly for two configurations.. one has a job running on card 2 and the other on card 1. No VUECC errors on either card.

CUDA_VISIBLE_DEVICES 1 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176 nvidia-smi Sat Feb 13 17:26:04 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:58:00.0 Off | 0 | | N/A 50C P0 116W / 250W | 11785MiB / 12198MiB | 87% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:D8:00.0 Off | 0 | | N/A 22C P0 24W / 250W | 0MiB / 12198MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 198073 C python 11775MiB | +-----------------------------------------------------------------------------+ AutoDock-GPU version: v1.3-17-g18266b7fea8774b06faa92537a9997b0957b4951-dirty

CUDA Setup time 0.142740s (Thread 0 is setting up Job 0)

Running Job #0: Local-search chosen method is: Solis-Wets (sw)

Rest of Setup time 0.014424s

CUDA_VISIBLE_DEVICES 0 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Sep__1_21:08:03_CDT_2017 Cuda compilation tools, release 9.0, V9.0.176 nvidia-smi Sat Feb 13 17:26:11 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:58:00.0 Off | 0 | | N/A 21C P0 24W / 250W | 0MiB / 12198MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:D8:00.0 Off | 0 | | N/A 32C P0 47W / 250W | 7149MiB / 12198MiB | 100% E. Process | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 1 28503 C python 7139MiB | +-----------------------------------------------------------------------------+ AutoDock-GPU version: v1.3-17-g18266b7fea8774b06faa92537a9997b0957b4951-dirty

CUDA Setup time 0.106279s (Thread 0 is setting up Job 0)

Running Job #0: Local-search chosen method is: Solis-Wets (sw)

atillack commented 3 years ago

@abazabaaa Seems like there is a strong correlation with the ECC errors being present - and it's still bailing at the first allocation of memory. Hmm. This time it's trying to allocate about 3.3 kB (the size of cData.pKerconst_interintra is 256 * 13 Bytes) which very likely exist ...

There's really only three things besides faulty hardware this typically boils down to in my experience: 1) Driver 2) Cuda version used for compiling vs on the machine 3) Machine needing a reboot

It does look like Cuda 9 was used for compiling and Cuda 10.2 is the driver's version - I also see that that's the case with the working example so it looks like it's working, but since it's probably easiest to do a module unload cuda ; module load cuda/10.2 (or your favorite module system's commands) rather than the other two it might be worth a shot.

Additionally, although it's very likely the OpenCL version (make DEVICE=OCLGPU NUMWI=128) may face the same issue there's a chance it may work (also, currently, it's a little bit faster compared to Cuda). (I don't think CUDA_VISIBLE_DEVICES affects OpenCL visibility, so you'll probably have to use it to set the device using your favorite shell math like so -devnum $((CUDA_VISIBLE_DEVICES+1)) (<- this should work with bash))

abazabaaa commented 3 years ago

@atillack

"It does look like Cuda 9 was used for compiling and Cuda 10.2 is the driver's version - I also see that that's the case with the working example so it looks like it's working, but since it's probably easiest to do a module unload cuda ; module load cuda/10.2 (or your favorite module system's commands) rather than the other two it might be worth a shot."

Is it possible to compile with cuda 10.2? I tried once or twice but got an error and didn't try again. I will see if I can get that to work..

I will also compile the OpenCL version and see how that looks.

atillack commented 3 years ago

@abazabaaa Yes, we use it with GCC 6.3.0 and Cuda 10.0, 10.2, and 11.0 here on some of our machines :-)

If you see a fatbinary_ (something) error when compiling with newer Cuda versions that means you need to set the include and library directory to where they are installed for Cuda 10.2 (or newer) with GPU_INCLUDE_PATH (in our case it's set to /opt/applications/cuda/10.2/include) and GPU_LIBRARY_PATH (in our case set to /opt/applications/cuda/10.2/lib64) - you can typically find out with module show cuda/10.2 or sometimes the environment variable CUDA_PATH is set by the module system (which nvcc will also tell you).

(you'll also usually want GPU_INCLUDE_PATH and GPU_LIBRARY_PATH set for OpenCL, but since for OpenCL neither the headers nor the library interface change between Cuda versions it's usually not sensitive to Cuda version changes)

abazabaaa commented 3 years ago

@atillack So I tried recompiling, and it did not appear to solve the issue. I do however, have a new error with OpenCL, so maybe it is more informative. How does the program specify how much memory it wants? Is there a way for me to change the code such that it asks for less memory or smaller blocks?

CUDA_VISIBLE_DEVICES 1 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:24:38_PDT_2019 Cuda compilation tools, release 10.2, V10.2.89 nvidia-smi Tue Feb 16 14:27:18 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:58:00.0 Off | 0 | | N/A 63C P0 169W / 250W | 11785MiB / 12198MiB | 100% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:D8:00.0 Off | 2 | | N/A 23C P0 24W / 250W | 1MiB / 12198MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 252660 C python 11775MiB | +-----------------------------------------------------------------------------+ AutoDock-GPU version: v1.3-17-g18266b7fea8774b06faa92537a9997b0957b4951

Kernel source used for development: ./device/calcenergy.cl
Kernel string used for building: ./host/inc/stringify.h
Kernel compilation flags: -I ./device -I ./common -DN128WI -cl-mad-enable OpenCL device: Tesla P100-PCIE-12GB Error: clCreateContext() -9999 AutoDock-GPU version: v1.3-17-g18266b7fea8774b06faa92537a9997b0957b4951

abazabaaa commented 3 years ago

@atillack It seems this error is specific to Nvidia:

Code	Vendor	Function(s)	Description
-9999	NVidia	clEnqueueNDRangeKernel	Illegal read or write to a buffer

atillack commented 3 years ago

@abazabaaa It seems those cards really don't want to run ... The OpenCL code fails at the point it tries to get an OpenCL context (in other words, tries to get the device) which shouldn't fail with -9999 (the code hasn't asked for any memory at this point).

That it does fail there in my opinion points towards the driver missing a beat or the gpu being in a weird state. I think it is very likely a node restart could fix this. You could also try using nvidia-smi to reset the device before running with nvidia-smi --gpu-reset -i 1 but you'll likely have insufficient permissions for that - your sys-admin should be able to do so, however.

abazabaaa commented 3 years ago

@atillack I don't think the reset is the issue -- that didn't help.

Our cluster is heavily used for medical image analysis and we have some contacts at Nvidia. We ran this by them and we got the following response. I am a bit new to CUDA... does any of this seem like it makes sense?

Thanks for the introduction Nicola.

Hello Spyros and Mark,

I'm a medical imaging solution architect located north of Philadelphia in Chalfont, PA, and will be happy to help with the issue you've described below. Regarding the question about recommended way to query available memory, I recommend using the cudaMemGetInfo() function call to retrieve the amount of available memory for new allocations. The available memory value should account for retired memory pages, as well as memory used by the CUDA runtime system.

Also, if possible, please make sure your code is checking for error returns from CUDA memory allocation calls to avoid segmentation faults. If you find that memory allocations are failing, but are within the available memory value, this may indicate a software bug. In that case, I can help with getting the issue resolved.

Best regards,
Bob

abazabaaa commented 3 years ago

@atillack If you prefer we can move this to email and get directly connected with Bob. He seems more than willing to help solve the issue. We are pretty motivated to debug this, as we like the software you have created : )

atillack commented 3 years ago

@abazabaaa Thanks for liking the code :-) I am happy to keep things public on here unless of course you'd like to have an additional side-channel (my email is my username here at scripps.edu).

I just PR'ed a bit of code that adds the device name and number as well as available memory to the output for Cuda (PR #122). Maybe this can help narrow down things here a bit. We already are checking and outputting the error codes of every single Cuda function call (including memory allocation).

The weird thing about what you're seeing is that it fails early - before much of anything is allocated. In the OpenCL version it even failed at trying to allocate the first 3.3 kB ... This is rather strange, particularly in conjunction with you mentioning that it runs on other nodes just fine. All of these things to me normally point toward the driver or the hardware.

abazabaaa commented 3 years ago

@atillack Thanks, I will try out PR #122 and see how things go. I am a bit new to using GitHub, so what is the process for getting the updated code and running it?

I think we will get to the bottom of it... It is hard to wrap my head around. I think the thing that makes me believe there is an issue in the code and not driver/node/card issues is that these cards are in heavy use -- They are running python based DL modules that use most of the available memory day in day out and I am the only one with issues. I will try the new changes you have made to pr 122 (I will see if I can figure out how to get that version and compile it) then report back.

Here is an additional person from NVIDIA commenting:

I second Bob's recommendations to use cuMemGetInfo() to check free memory and to always check CUDA return codes for errors to avoid seg faults. Please let us know if this does not resolve the issue. Regards, Brad

Brad Palmer
Senior Solutions Architect
Higher Education and Research
NVIDIA

atillack commented 3 years ago

@abazabaaa The PR has been merged - so all you'll need to do is git pull. The memory reporting lines are in performdocking.cpp.Cuda line 147 and 148 - in the same function (which does all the memory allocation) you could copy & paste it before each one to get the available amount of memory.

atillack commented 3 years ago

(one more note: as you likely edited performdocking.cpp.Cuda you may additionally have to do a git checkout -- host/src/performdocking.cpp.Cuda from AD-GPU's main folder)

ccsb-scripps / AutoDock-GPU

CUDA_VISIBLE_DEVICES not being used to set device number #119