Closed giladqm closed 1 month ago
the docker is: nvcr.io/nvidia/cuquantum-appliance:24.03-arm64
@giladqm thanks for posting your issue. After an initial look into this, I can confirm a few things:
Using CUDA driver 535, the issue you reported with the example you call out from the Appliance container cannot be reproduced.
More generally, your issue seems to be related to an improperly configured MIG setup on the node in question.
To confirm, please launch an interactive session of the cuQuantum Appliance container with this command:
docker run --runtime nvidia --gpus all --rm -it nvcr.io/nvidia/cuquantum-appliance:24.03-arm64
Next, execute this command:
python -c "import cupy as cp; dev = cp.cuda.Device(0); cp.cuda.runtime.setDevice(dev); print(f'default device id: {cp.cuda.Device().id}'); empty = cp.empty((2,2)); print(f'memory alloc. by pool: {cp.get_default_memory_pool().used_bytes()}'); print(f'num. dev. detected by cupy: {cp.cuda.runtime.getDeviceCount()}')"
Because of how quotation marks are escaped by Docker, I recommend executing the (above) command in an interactive session. In a Python file, it looks like this:
import cupy as cp
dev = cp.cuda.Device(0)
cp.cuda.runtime.setDevice(dev)
print(f'default device id: {cp.cuda.Device().id}'); empty = cp.empty((2,2))
print(f'memory alloc. by pool: {cp.get_default_memory_pool().used_bytes()}')
print(f'num. dev. detected by cupy: {cp.cuda.runtime.getDeviceCount()}')
If these commands fail, it implies that cupy
in the Appliance Container cannot detect GPUs on your system, indicating a problem with the system.
Can you run these commands and share with us the output?
Thanks, Matthew J
The issue was with the MIG setup. My colleague fixed it:
(base) nikola@gracehopper:~$ sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 0 MIG 7g.96gb 0 0 0:8 |
+-------------------------------------------------------+
(base) nikola@gracehopper:~$ echo $CUDA_VISIBLE_DEVICES
(base) nikola@gracehopper:~$ nvidia-smi -L
GPU 0: NVIDIA GH200 480GB (UUID: GPU-d8731c65-c898-919e-74c9-286b27400dac)
MIG 7g.96gb Device 0: (UUID: MIG-7baedeb1-c0d7-53ba-9926-2e341a42b470)
(cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ nvidia-smi Tue May 21 07:58:14 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GH200 480GB On | 00000009:01:00.0 Off | On | | N/A 23C P0 62W / 900W | 5MiB / 97871MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | No MIG devices found | +-----------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ (cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:10:07_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0 (cuquantum-24.03) cuquantum@12c74b885e0a:~/examples$ python simon.py Secret string = [0 1 1] CUDA error: system not yet initialized device_management.h 91