NVIDIA / cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
https://docs.nvidia.com/cuda/cuquantum/
BSD 3-Clause "New" or "Revised" License
320 stars 63 forks source link

Unable to run cuqauntum in GH200 (grace hopper nvidia) docker arm64 #139

Closed giladqm closed 1 month ago

giladqm commented 1 month ago

(cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ nvidia-smi Tue May 21 07:58:14 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | On | | N/A 23C P0 62W / 900W | 5MiB / 97871MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | No MIG devices found | +-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ (cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:10:07_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 (cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ python simon.py Secret string = [0 1 1] CUDA error: system not yet initialized device_management.h 91

giladqm commented 1 month ago

the docker is: nvcr.io/nvidia/cuquantum-appliance:24.03-arm64

mtjrider commented 1 month ago

@giladqm thanks for posting your issue. After an initial look into this, I can confirm a few things:

Using CUDA driver 535, the issue you reported with the example you call out from the Appliance container cannot be reproduced.

More generally, your issue seems to be related to an improperly configured MIG setup on the node in question.

To confirm, please launch an interactive session of the cuQuantum Appliance container with this command:

docker run --runtime nvidia --gpus all --rm -it nvcr.io/nvidia/cuquantum-appliance:24.03-arm64

Next, execute this command:

python -c "import cupy as cp; dev = cp.cuda.Device(0); cp.cuda.runtime.setDevice(dev); print(f'default device id: {cp.cuda.Device().id}'); empty = cp.empty((2,2)); print(f'memory alloc. by pool: {cp.get_default_memory_pool().used_bytes()}'); print(f'num. dev. detected by cupy: {cp.cuda.runtime.getDeviceCount()}')"

Because of how quotation marks are escaped by Docker, I recommend executing the (above) command in an interactive session. In a Python file, it looks like this:

import cupy as cp

dev = cp.cuda.Device(0)
cp.cuda.runtime.setDevice(dev)
print(f'default device id: {cp.cuda.Device().id}'); empty = cp.empty((2,2))
print(f'memory alloc. by pool: {cp.get_default_memory_pool().used_bytes()}')
print(f'num. dev. detected by cupy: {cp.cuda.runtime.getDeviceCount()}')

If these commands fail, it implies that cupy in the Appliance Container cannot detect GPUs on your system, indicating a problem with the system.

Can you run these commands and share with us the output?

Thanks, Matthew J

giladqm commented 1 month ago

The issue was with the MIG setup. My colleague fixed it:

(base) nikola@gracehopper:~$  sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 7g.96gb          0        0          0:8     |
+-------------------------------------------------------+
(base) nikola@gracehopper:~$ echo $CUDA_VISIBLE_DEVICES

(base) nikola@gracehopper:~$ nvidia-smi -L
GPU 0: NVIDIA GH200 480GB (UUID: GPU-d8731c65-c898-919e-74c9-286b27400dac)
  MIG 7g.96gb     Device  0: (UUID: MIG-7baedeb1-c0d7-53ba-9926-2e341a42b470)