cudart.cudaSetDevice allocates memory on GPU other than target

QuiteAFoxtrot commented 2 years ago

cuda-python 11.6.1 cuda toolkit 11.2 Ubuntu Linux

If you run something like the following on a multi-GPU machine

device_num = 5
err, = cuda.cuInit(0)
err, device = cuda.cuDeviceGet(device_num)
err, cuda_context = cuda.cuCtxCreate(0, device)
err, = cudart.cudaSetDevice(device)

The call to cudart.cudaSetDevice will properly set your device to '5', but it will also allocate ~305 MB of memory on device 0 (or whichever is the 0th device in the device list provided by CUDA_VISIBLE_DEVICES). I think this issue (possibly in the C-CUDA runtime underneath?) may possibly be the root of many downstream issues in libraries like Tensorflow and Pytorch who have similar issues where a user selects a device but still gets tons of allocations on other devices. This 305 MB may not sound like a lot, but I'm running a program on an Nvidia-DGX with 16 GPUs and I have 64 worker processes, causing 64*305 = 19GB of unusable space to be allocated on GPU 0, which crashes the program. I cannot simply set CUDA_VISIBLE_DEVICES to correct this problem because the workers are communicating via shared GPU memory (via cuIPCMemHandle) with their parent process, and the parent process needs access to all GPUs. Additionally, the worker processes are performing data augmentation on one GPU, while writing output to another GPU with a different device ID.

I am trying to investigate a workaround to not call 'cudart.cudaSetDevice' at all, but when it is not called I cannot properly use the pointer given by cuda.cuMemAlloc to create a PyTorch tensor. When I call cudart.cudaSetDevice, I am able to use the pointer properly.

vzhurba01 commented 2 years ago

Thanks for the report! I've pushed a fix a6511d5 to main branch and it can be installed from source.

The PyPi/Conda packages will receive the fix in the next release.

FYI for the code snippet you likely want to pass the device_num instead of device. Here the device_num is a device ordinal whereas device is a device handle. Their integer representation may not always match. (in fact cudaSetDevice internally calls cuDeviceGet on the passed deviceOrdinal to get a device handle)

QuiteAFoxtrot commented 2 years ago

Great, thank you! A question regarding your device_num advice - does that hold up under CUDA_VISIBLE_DEVICES? For example, if CUDA_VISIBLE_DEVICES = 4,5,6,7 and I set device_num=2, presumably that actually gives me a pointer to device 6 - so if I passed a "2" into cuda.cuCtxCreate, will that also properly map to device 6?

I've also noticed some strange behavior with cuda.cuDeviceCanAccessPeer - namely if you pass in the same device twice it reports that access is not possible (which I interpret as, it can't map its own memory?). Is that intentional behavior? If so, would you like me to open another issue?

vzhurba01 commented 2 years ago

does that hold up under CUDA_VISIBLE_DEVICES?

It holds up. Your following example works just like that.

which I interpret as, it can't map its own memory?

That's expected, peer-to-self is disallowed.

NVIDIA / cuda-python

cudart.cudaSetDevice allocates memory on GPU other than target #20