NVIDIA / nvidia-container-runtime

NVIDIA container runtime
Apache License 2.0
1.1k stars 159 forks source link

Unable to determine the device handle for GPU0000:65:00.0: Unknown Error -- only from container #189

Closed ztz0223 closed 1 year ago

ztz0223 commented 1 year ago

Hi here

I need your help :)

I installed 2 A4000 video cards on my Dell T5820 which got the RHEL 8.6 running. After I installed the Nvidia driver and CUDA driver, I can run nvidia-smi correctly:

[xxx ~]$ nvidia-smi
Wed Jun  7 22:34:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:65:00.0  On |                  Off |
| 41%   30C    P8    11W / 140W |     93MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000    Off  | 00000000:B3:00.0 Off |                  Off |
| 41%   30C    P8     8W / 140W |     93MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5301      G   /usr/libexec/Xorg                  40MiB |
|    0   N/A  N/A      5432      G   /usr/bin/gnome-shell               51MiB |
|    1   N/A  N/A      5301      G   /usr/libexec/Xorg                  40MiB |
|    1   N/A  N/A      5432      G   /usr/bin/gnome-shell               51MiB |
+-----------------------------------------------------------------------------+

[tianzuoz@shsidock3 ~]$ nvidia-smi -i 0
Wed Jun  7 22:38:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:65:00.0  On |                  Off |
| 41%   29C    P8    10W / 140W |     93MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5301      G   /usr/libexec/Xorg                  40MiB |
|    0   N/A  N/A      5432      G   /usr/bin/gnome-shell               51MiB |
+-----------------------------------------------------------------------------+
[xxx ~]$ nvidia-smi -i 1
Wed Jun  7 22:38:17 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   1  NVIDIA RTX A4000    Off  | 00000000:B3:00.0  On |                  Off |
| 41%   30C    P8     8W / 140W |     93MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    1   N/A  N/A      5301      G   /usr/libexec/Xorg                  40MiB |
|    1   N/A  N/A      5432      G   /usr/bin/gnome-shell               51MiB |
+-----------------------------------------------------------------------------+
[xxx ~]$

But when I run nvidia-smi in the container, one of the device will report error:


[xxx ~]$ docker run --rm --runtime=nvidia --gpus '"device=1"' nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 11.4.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Wed Jun  7 14:26:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:B3:00.0  On |                  Off |
| 41%   33C    P8     8W / 140W |    135MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
[xxx ~]$ docker run --rm --runtime=nvidia --gpus '"device=0"' nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 11.4.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Unable to determine the device handle for GPU0000:65:00.0: Unknown Error

I tried some other cuda images, but same failures: Unable to determine the device handle for GPU0000:65:00.0: Unknown Error

So I run the container in batch mode, got the errors:

root@1d79e4242c2b:/# nvidia-smi -L
Unable to determine the device handle for gpu 0000:65:00.0: Unknown Error
root@1d79e4242c2b:/#
root@1d79e4242c2b:/# nvidia-debugdump --dumpall
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpGpuComponent() failed, return code: 0x3e7
ERROR: internal_dumpNvLogComponent() failed, return code: 0x3e7
root@1d79e4242c2b:/# nvidia-debugdump --list
Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error
root@1d79e4242c2b:/#

Any ideas about this?

Thanks a lot!

elezar commented 1 year ago

Closing this as a duplicate of https://github.com/NVIDIA/nvidia-container-toolkit/issues/69