Unable to run cuqauntum in GH200 (grace hopper nvidia) docker arm64

giladqm commented 1 month ago

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ (cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:10:07_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 (cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ python simon.py Secret string = [0 1 1] CUDA error: system not yet initialized device_management.h 91

giladqm commented 1 month ago

the docker is: nvcr.io/nvidia/cuquantum-appliance:24.03-arm64

mtjrider commented 1 month ago

@giladqm thanks for posting your issue. After an initial look into this, I can confirm a few things:

Using CUDA driver 535, the issue you reported with the example you call out from the Appliance container cannot be reproduced.

More generally, your issue seems to be related to an improperly configured MIG setup on the node in question.

To confirm, please launch an interactive session of the cuQuantum Appliance container with this command:

docker run --runtime nvidia --gpus all --rm -it nvcr.io/nvidia/cuquantum-appliance:24.03-arm64

Next, execute this command:

python -c "import cupy as cp; dev = cp.cuda.Device(0); cp.cuda.runtime.setDevice(dev); print(f'default device id: {cp.cuda.Device().id}'); empty = cp.empty((2,2)); print(f'memory alloc. by pool: {cp.get_default_memory_pool().used_bytes()}'); print(f'num. dev. detected by cupy: {cp.cuda.runtime.getDeviceCount()}')"

Because of how quotation marks are escaped by Docker, I recommend executing the (above) command in an interactive session. In a Python file, it looks like this:

import cupy as cp

dev = cp.cuda.Device(0)
cp.cuda.runtime.setDevice(dev)
print(f'default device id: {cp.cuda.Device().id}'); empty = cp.empty((2,2))
print(f'memory alloc. by pool: {cp.get_default_memory_pool().used_bytes()}')
print(f'num. dev. detected by cupy: {cp.cuda.runtime.getDeviceCount()}')

If these commands fail, it implies that cupy in the Appliance Container cannot detect GPUs on your system, indicating a problem with the system.

Can you run these commands and share with us the output?

Thanks, Matthew J

giladqm commented 1 month ago

The issue was with the MIG setup. My colleague fixed it:

(base) nikola@gracehopper:~$  sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 7g.96gb          0        0          0:8     |
+-------------------------------------------------------+
(base) nikola@gracehopper:~$ echo $CUDA_VISIBLE_DEVICES

(base) nikola@gracehopper:~$ nvidia-smi -L
GPU 0: NVIDIA GH200 480GB (UUID: GPU-d8731c65-c898-919e-74c9-286b27400dac)
  MIG 7g.96gb     Device  0: (UUID: MIG-7baedeb1-c0d7-53ba-9926-2e341a42b470)

NVIDIA / cuQuantum

Unable to run cuqauntum in GH200 (grace hopper nvidia) docker arm64 #139