NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.45k stars 260 forks source link

nvidia-container-cli: detection error: nvml error: unknown error: unknown. #632

Closed Crema-new closed 3 months ago

Crema-new commented 3 months ago

i meet the same issue as #416

error out

root@inspur:/home/devops/ais.stat# docker-compose up -d
Creating network "aisstat_default" with the default driver
Creating aisstat_pg_1    ... done
Creating aisstat_htdet_1 ... error

ERROR: for aisstat_htdet_1  Cannot start service htdet: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: device error: 2: unknown device: unknown

ERROR: for htdet  Cannot start service htdet: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: device error: 2: unknown device: unknown
ERROR: Encountered errors while bringing up the project.

the same version of ubuntu

root@inspur:/home/devops/ais.stat# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
root@inspur:/home/devops/ais.stat# uname -r
5.15.0-117-generic

lower docker version

Client: Docker Engine - Community
 Version:    27.1.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 23.0.1

daemon.json


root@inspur:/home/devops/ais.stat# cat /etc/docker/daemon.json
{
    "data-root": "/data/docker",
    "log-opts": {
        "max-file": "5",
        "max-size": "50m"
    },
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }

}

nvidia info

root@inspur:/home/devops/ais.stat# nvidia-smi
Wed Aug  7 11:04:29 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:B1:00.0 Off |                    0 |
| N/A   66C    P0             32W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

root@inspur:/home/devops/ais.stat# nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0807 03:05:50.709582 20669 nvc.c:376] initializing library context (version=1.13.5, build=66607bd046341f7aad7de80a9f022f122d1f2fce)
I0807 03:05:50.709625 20669 nvc.c:350] using root /
I0807 03:05:50.709631 20669 nvc.c:351] using ldcache /etc/ld.so.cache
I0807 03:05:50.709634 20669 nvc.c:352] using unprivileged user 65534:65534
I0807 03:05:50.709656 20669 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0807 03:05:50.709834 20669 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0807 03:05:50.715497 20670 nvc.c:278] loading kernel module nvidia
I0807 03:05:50.715726 20670 nvc.c:282] running mknod for /dev/nvidiactl
I0807 03:05:50.715813 20670 nvc.c:286] running mknod for /dev/nvidia0
I0807 03:05:50.715878 20670 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0807 03:05:50.722432 20670 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0807 03:05:50.722524 20670 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0807 03:05:50.724188 20670 nvc.c:296] loading kernel module nvidia_uvm
I0807 03:05:50.724216 20670 nvc.c:300] running mknod for /dev/nvidia-uvm
I0807 03:05:50.724262 20670 nvc.c:305] loading kernel module nvidia_modeset
I0807 03:05:50.724280 20670 nvc.c:309] running mknod for /dev/nvidia-modeset
I0807 03:05:50.724555 20671 rpc.c:71] starting driver rpc service
I0807 03:05:52.436201 20682 rpc.c:71] starting nvcgo rpc service
I0807 03:05:52.437660 20669 nvc_info.c:798] requesting driver information with ''
I0807 03:05:52.438626 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.550.107.02
I0807 03:05:52.438780 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.550.107.02
I0807 03:05:52.438823 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.107.02
I0807 03:05:52.438854 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.107.02
I0807 03:05:52.438888 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.107.02
I0807 03:05:52.438942 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.550.107.02
I0807 03:05:52.438983 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.107.02
I0807 03:05:52.439028 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.107.02
I0807 03:05:52.439103 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.107.02
I0807 03:05:52.439153 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.107.02
I0807 03:05:52.439226 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.107.02
I0807 03:05:52.439275 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.107.02
I0807 03:05:52.439351 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.107.02
I0807 03:05:52.439399 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.107.02
I0807 03:05:52.439447 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.107.02
I0807 03:05:52.439498 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.107.02
I0807 03:05:52.439546 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.107.02
I0807 03:05:52.439618 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.107.02
I0807 03:05:52.439690 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.107.02
I0807 03:05:52.439743 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.107.02
I0807 03:05:52.439816 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.107.02
I0807 03:05:52.439891 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.550.107.02
I0807 03:05:52.440089 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.550.107.02
I0807 03:05:52.440136 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.550.107.02
I0807 03:05:52.440267 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.550.107.02
I0807 03:05:52.440319 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.550.107.02
I0807 03:05:52.440371 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.550.107.02
I0807 03:05:52.440421 20669 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.550.107.02
W0807 03:05:52.440452 20669 nvc_info.c:402] missing library libnvidia-nscq.so
W0807 03:05:52.440460 20669 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W0807 03:05:52.440466 20669 nvc_info.c:402] missing library libnvidia-compiler.so
W0807 03:05:52.440471 20669 nvc_info.c:402] missing library libnvidia-ifr.so
W0807 03:05:52.440477 20669 nvc_info.c:402] missing library libnvidia-cbl.so
W0807 03:05:52.440484 20669 nvc_info.c:406] missing compat32 library libnvidia-ml.so
W0807 03:05:52.440491 20669 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W0807 03:05:52.440500 20669 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W0807 03:05:52.440507 20669 nvc_info.c:406] missing compat32 library libcuda.so
W0807 03:05:52.440515 20669 nvc_info.c:406] missing compat32 library libcudadebugger.so
W0807 03:05:52.440525 20669 nvc_info.c:406] missing compat32 library libnvidia-opencl.so
W0807 03:05:52.440533 20669 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so
W0807 03:05:52.440540 20669 nvc_info.c:406] missing compat32 library libnvidia-ptxjitcompiler.so
W0807 03:05:52.440547 20669 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W0807 03:05:52.440554 20669 nvc_info.c:406] missing compat32 library libnvidia-allocator.so
W0807 03:05:52.440561 20669 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W0807 03:05:52.440570 20669 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W0807 03:05:52.440579 20669 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W0807 03:05:52.440589 20669 nvc_info.c:406] missing compat32 library libnvidia-nvvm.so
W0807 03:05:52.440596 20669 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W0807 03:05:52.440607 20669 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so
W0807 03:05:52.440615 20669 nvc_info.c:406] missing compat32 library libnvidia-encode.so
W0807 03:05:52.440626 20669 nvc_info.c:406] missing compat32 library libnvidia-opticalflow.so
W0807 03:05:52.440636 20669 nvc_info.c:406] missing compat32 library libnvcuvid.so
W0807 03:05:52.440646 20669 nvc_info.c:406] missing compat32 library libnvidia-eglcore.so
W0807 03:05:52.440654 20669 nvc_info.c:406] missing compat32 library libnvidia-glcore.so
W0807 03:05:52.440662 20669 nvc_info.c:406] missing compat32 library libnvidia-tls.so
W0807 03:05:52.440672 20669 nvc_info.c:406] missing compat32 library libnvidia-glsi.so
W0807 03:05:52.440680 20669 nvc_info.c:406] missing compat32 library libnvidia-fbc.so
W0807 03:05:52.440690 20669 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W0807 03:05:52.440698 20669 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W0807 03:05:52.440706 20669 nvc_info.c:406] missing compat32 library libnvoptix.so
W0807 03:05:52.440713 20669 nvc_info.c:406] missing compat32 library libGLX_nvidia.so
W0807 03:05:52.440721 20669 nvc_info.c:406] missing compat32 library libEGL_nvidia.so
W0807 03:05:52.440728 20669 nvc_info.c:406] missing compat32 library libGLESv2_nvidia.so
W0807 03:05:52.440736 20669 nvc_info.c:406] missing compat32 library libGLESv1_CM_nvidia.so
W0807 03:05:52.440744 20669 nvc_info.c:406] missing compat32 library libnvidia-glvkspirv.so
W0807 03:05:52.440751 20669 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I0807 03:05:52.441128 20669 nvc_info.c:302] selecting /usr/bin/nvidia-smi
I0807 03:05:52.441157 20669 nvc_info.c:302] selecting /usr/bin/nvidia-debugdump
I0807 03:05:52.441184 20669 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced
I0807 03:05:52.441228 20669 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control
I0807 03:05:52.441254 20669 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server
W0807 03:05:52.441371 20669 nvc_info.c:428] missing binary nv-fabricmanager
I0807 03:05:52.441444 20669 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/550.107.02/gsp_ga10x.bin
I0807 03:05:52.441453 20669 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/550.107.02/gsp_tu10x.bin
I0807 03:05:52.441492 20669 nvc_info.c:561] listing device /dev/nvidiactl
I0807 03:05:52.441500 20669 nvc_info.c:561] listing device /dev/nvidia-uvm
I0807 03:05:52.441508 20669 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I0807 03:05:52.441517 20669 nvc_info.c:561] listing device /dev/nvidia-modeset
W0807 03:05:52.441554 20669 nvc_info.c:352] missing ipc path /var/run/nvidia-persistenced/socket
W0807 03:05:52.441585 20669 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W0807 03:05:52.441607 20669 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I0807 03:05:52.441615 20669 nvc_info.c:854] requesting device information with ''
I0807 03:05:52.448609 20669 nvc_info.c:745] listing device /dev/nvidia0 (GPU-1476c0c7-a0b5-00b3-182e-0301f182b162 at 00000000:b1:00.0)
NVRM version:   550.107.02
CUDA version:   12.4
Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-1476c0c7-a0b5-00b3-182e-0301f182b162
Bus Location:   00000000:b1:00.0
Architecture:   7.5
I0807 03:05:52.448661 20669 nvc.c:434] shutting down library context
I0807 03:05:52.448695 20682 rpc.c:95] terminating nvcgo rpc service
I0807 03:05:52.449144 20669 rpc.c:135] nvcgo rpc service terminated successfully
I0807 03:05:52.826819 20671 rpc.c:95] terminating driver rpc service
I0807 03:05:52.826935 20669 rpc.c:135] driver rpc service terminated successfully

permissions of the device nodes

root@inspur:/home/devops/ais.stat# ls -al /dev/nv*
crw-rw-rw- 1 root root 195,   0 Aug  7 09:46 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Aug  7 09:46 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug  7 09:46 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508,   0 Aug  7 09:46 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508,   1 Aug  7 09:46 /dev/nvidia-uvm-tools
crw------- 1 root root  10, 144 Aug  7 09:43 /dev/nvram

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root root     80 Aug  7 09:46 .
drwxr-xr-x 20 root root   4340 Aug  7 10:56 ..
cr--------  1 root root 511, 1 Aug  7 09:46 nvidia-cap1
cr--r--r--  1 root root 511, 2 Aug  7 09:46 nvidia-cap2

i've tried reinstalling the driver several times. but the same error out

Crema-new commented 3 months ago

i found that i can use nvidia-docker without error log

root@inspur:/home/devops/ais.stat# docker run --runtime=nvidia f5ac1ad505db nvidia-smi

Wed Aug  7 06:46:37 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:B1:00.0 Off |                    0 |
| N/A   72C    P0              32W /  70W |      2MiB / 15360MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

thus i thought the problem must be in docker-compose or its config after i checked the "docker-compose.yml", i found the application was assigned two gpus which actually is only one