I installed 2 A4000 video cards on my Dell T5820 which got the RHEL 8.6 running. After I installed the Nvidia driver and CUDA driver, I can run nvidia-smi correctly:
[xxx ~]$ nvidia-smi
Wed Jun 7 22:34:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:65:00.0 On | Off |
| 41% 30C P8 11W / 140W | 93MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:B3:00.0 Off | Off |
| 41% 30C P8 8W / 140W | 93MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5301 G /usr/libexec/Xorg 40MiB |
| 0 N/A N/A 5432 G /usr/bin/gnome-shell 51MiB |
| 1 N/A N/A 5301 G /usr/libexec/Xorg 40MiB |
| 1 N/A N/A 5432 G /usr/bin/gnome-shell 51MiB |
+-----------------------------------------------------------------------------+
[tianzuoz@shsidock3 ~]$ nvidia-smi -i 0
Wed Jun 7 22:38:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:65:00.0 On | Off |
| 41% 29C P8 10W / 140W | 93MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 5301 G /usr/libexec/Xorg 40MiB |
| 0 N/A N/A 5432 G /usr/bin/gnome-shell 51MiB |
+-----------------------------------------------------------------------------+
[xxx ~]$ nvidia-smi -i 1
Wed Jun 7 22:38:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 1 NVIDIA RTX A4000 Off | 00000000:B3:00.0 On | Off |
| 41% 30C P8 8W / 140W | 93MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 5301 G /usr/libexec/Xorg 40MiB |
| 1 N/A N/A 5432 G /usr/bin/gnome-shell 51MiB |
+-----------------------------------------------------------------------------+
[xxx ~]$
But when I run nvidia-smi in the container, one of the device will report error:
[xxx ~]$ docker run --rm --runtime=nvidia --gpus '"device=1"' nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 11.4.0
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Wed Jun 7 14:26:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:B3:00.0 On | Off |
| 41% 33C P8 8W / 140W | 135MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
[xxx ~]$ docker run --rm --runtime=nvidia --gpus '"device=0"' nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 11.4.0
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error
I tried some other cuda images, but same failures:
Unable to determine the device handle for GPU0000:65:00.0: Unknown Error
So I run the container in batch mode, got the errors:
root@1d79e4242c2b:/# nvidia-smi -L
Unable to determine the device handle for gpu 0000:65:00.0: Unknown Error
root@1d79e4242c2b:/#
root@1d79e4242c2b:/# nvidia-debugdump --dumpall
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
root@1d79e4242c2b:/# nvidia-debugdump --list
Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
root@1d79e4242c2b:/#
Hi here
I need your help :)
I installed 2 A4000 video cards on my Dell T5820 which got the RHEL 8.6 running. After I installed the Nvidia driver and CUDA driver, I can run nvidia-smi correctly:
But when I run nvidia-smi in the container, one of the device will report error:
I tried some other cuda images, but same failures: Unable to determine the device handle for GPU0000:65:00.0: Unknown Error
So I run the container in batch mode, got the errors:
Any ideas about this?
Thanks a lot!