NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.45k stars 261 forks source link

nvidia-container-cli: detection error: nvml error: unknown error #416

Closed shuoshadow closed 7 months ago

shuoshadow commented 7 months ago
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown.
root@ubuntu:~# uname -a
Linux ubuntu 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
root@ubuntu:~# uname -r
5.15.0-100-generic
root@ubuntu:~# cat /etc/issue
Ubuntu 22.04.4 LTS \n \l
docker version
Client: Docker Engine - Community
 Version:           25.0.4
 API version:       1.44
 Go version:        go1.21.8
 Git commit:        1a576c5
 Built:             Wed Mar  6 16:32:12 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.4
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       061aa95
  Built:            Wed Mar  6 16:32:12 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 nvidia:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
root@ubuntu:~# nvidia-smi 
Mon Mar 18 10:11:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  ERR!                            Off| 00000000:01:00.0 Off |                  N/A |
| 30%   27C    P0               53W / 320W|      0MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@ubuntu:~# nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0318 10:12:52.766616 11514 nvc.c:393] initializing library context (version=1.14.6, build=d2eb0afe86f0b643e33624ee64f065dd60e952d4)
I0318 10:12:52.766720 11514 nvc.c:364] using root /
I0318 10:12:52.766731 11514 nvc.c:365] using ldcache /etc/ld.so.cache
I0318 10:12:52.766739 11514 nvc.c:366] using unprivileged user 65534:65534
I0318 10:12:52.766769 11514 nvc.c:410] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0318 10:12:52.766987 11514 nvc.c:412] dxcore initialization failed, continuing assuming a non-WSL environment
I0318 10:12:52.768666 11515 nvc.c:278] loading kernel module nvidia
I0318 10:12:52.768936 11515 nvc.c:282] running mknod for /dev/nvidiactl
I0318 10:12:52.769035 11515 nvc.c:286] running mknod for /dev/nvidia0
I0318 10:12:52.769087 11515 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0318 10:12:52.775334 11515 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0318 10:12:52.775398 11515 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0318 10:12:52.776083 11515 nvc.c:301] loading kernel module nvidia_uvm
I0318 10:12:52.776091 11515 nvc.c:305] running mknod for /dev/nvidia-uvm
I0318 10:12:52.776134 11515 nvc.c:310] loading kernel module nvidia_modeset
I0318 10:12:52.776140 11515 nvc.c:314] running mknod for /dev/nvidia-modeset
I0318 10:12:52.776433 11516 rpc.c:71] starting driver rpc service
I0318 10:12:53.213861 11520 rpc.c:71] starting nvcgo rpc service
I0318 10:12:53.215349 11514 nvc_info.c:797] requesting driver information with ''
I0318 10:12:53.217414 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.530.30.02
I0318 10:12:53.217561 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.530.30.02
I0318 10:12:53.217639 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02
I0318 10:12:53.217688 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
I0318 10:12:53.217737 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.530.30.02
I0318 10:12:53.217805 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.530.30.02
I0318 10:12:53.217874 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.530.30.02
I0318 10:12:53.217921 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.530.30.02
I0318 10:12:53.217989 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02
I0318 10:12:53.218036 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.530.30.02
I0318 10:12:53.218104 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02
I0318 10:12:53.218149 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02
I0318 10:12:53.218199 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02
I0318 10:12:53.218247 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.530.30.02
I0318 10:12:53.218319 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.530.30.02
I0318 10:12:53.218391 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02
I0318 10:12:53.218444 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02
I0318 10:12:53.218494 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.530.30.02
I0318 10:12:53.218567 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.530.30.02
I0318 10:12:53.218639 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.530.30.02
I0318 10:12:53.218878 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.530.30.02
I0318 10:12:53.218927 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.530.30.02
I0318 10:12:53.219055 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.530.30.02
I0318 10:12:53.219111 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.530.30.02
I0318 10:12:53.219169 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.530.30.02
I0318 10:12:53.219227 11514 nvc_info.c:175] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.530.30.02
W0318 10:12:53.219262 11514 nvc_info.c:401] missing library libnvidia-nscq.so
W0318 10:12:53.219272 11514 nvc_info.c:401] missing library libnvidia-gpucomp.so
W0318 10:12:53.219283 11514 nvc_info.c:401] missing library libnvidia-fatbinaryloader.so
W0318 10:12:53.219293 11514 nvc_info.c:401] missing library libnvidia-pkcs11.so
W0318 10:12:53.219302 11514 nvc_info.c:401] missing library libnvidia-pkcs11-openssl3.so
W0318 10:12:53.219312 11514 nvc_info.c:401] missing library libnvidia-ifr.so
W0318 10:12:53.219321 11514 nvc_info.c:401] missing library libnvidia-cbl.so
W0318 10:12:53.219329 11514 nvc_info.c:405] missing compat32 library libnvidia-ml.so
W0318 10:12:53.219348 11514 nvc_info.c:405] missing compat32 library libnvidia-cfg.so
W0318 10:12:53.219358 11514 nvc_info.c:405] missing compat32 library libnvidia-nscq.so
W0318 10:12:53.219366 11514 nvc_info.c:405] missing compat32 library libcuda.so
W0318 10:12:53.219377 11514 nvc_info.c:405] missing compat32 library libcudadebugger.so
W0318 10:12:53.219388 11514 nvc_info.c:405] missing compat32 library libnvidia-opencl.so
W0318 10:12:53.219397 11514 nvc_info.c:405] missing compat32 library libnvidia-gpucomp.so
W0318 10:12:53.219407 11514 nvc_info.c:405] missing compat32 library libnvidia-ptxjitcompiler.so
W0318 10:12:53.219415 11514 nvc_info.c:405] missing compat32 library libnvidia-fatbinaryloader.so
W0318 10:12:53.219425 11514 nvc_info.c:405] missing compat32 library libnvidia-allocator.so
W0318 10:12:53.219433 11514 nvc_info.c:405] missing compat32 library libnvidia-compiler.so
W0318 10:12:53.219444 11514 nvc_info.c:405] missing compat32 library libnvidia-pkcs11.so
W0318 10:12:53.219451 11514 nvc_info.c:405] missing compat32 library libnvidia-pkcs11-openssl3.so
W0318 10:12:53.219460 11514 nvc_info.c:405] missing compat32 library libnvidia-nvvm.so
W0318 10:12:53.219469 11514 nvc_info.c:405] missing compat32 library libnvidia-ngx.so
W0318 10:12:53.219477 11514 nvc_info.c:405] missing compat32 library libvdpau_nvidia.so
W0318 10:12:53.219486 11514 nvc_info.c:405] missing compat32 library libnvidia-encode.so
W0318 10:12:53.219494 11514 nvc_info.c:405] missing compat32 library libnvidia-opticalflow.so
W0318 10:12:53.219505 11514 nvc_info.c:405] missing compat32 library libnvcuvid.so
W0318 10:12:53.219517 11514 nvc_info.c:405] missing compat32 library libnvidia-eglcore.so
W0318 10:12:53.219526 11514 nvc_info.c:405] missing compat32 library libnvidia-glcore.so
W0318 10:12:53.219532 11514 nvc_info.c:405] missing compat32 library libnvidia-tls.so
W0318 10:12:53.219540 11514 nvc_info.c:405] missing compat32 library libnvidia-glsi.so
W0318 10:12:53.219548 11514 nvc_info.c:405] missing compat32 library libnvidia-fbc.so
W0318 10:12:53.219556 11514 nvc_info.c:405] missing compat32 library libnvidia-ifr.so
W0318 10:12:53.219565 11514 nvc_info.c:405] missing compat32 library libnvidia-rtcore.so
W0318 10:12:53.219576 11514 nvc_info.c:405] missing compat32 library libnvoptix.so
W0318 10:12:53.219584 11514 nvc_info.c:405] missing compat32 library libGLX_nvidia.so
W0318 10:12:53.219592 11514 nvc_info.c:405] missing compat32 library libEGL_nvidia.so
W0318 10:12:53.219604 11514 nvc_info.c:405] missing compat32 library libGLESv2_nvidia.so
W0318 10:12:53.219612 11514 nvc_info.c:405] missing compat32 library libGLESv1_CM_nvidia.so
W0318 10:12:53.219621 11514 nvc_info.c:405] missing compat32 library libnvidia-glvkspirv.so
W0318 10:12:53.219627 11514 nvc_info.c:405] missing compat32 library libnvidia-cbl.so
I0318 10:12:53.220044 11514 nvc_info.c:301] selecting /usr/bin/nvidia-smi
I0318 10:12:53.220075 11514 nvc_info.c:301] selecting /usr/bin/nvidia-debugdump
I0318 10:12:53.220105 11514 nvc_info.c:301] selecting /usr/bin/nvidia-persistenced
I0318 10:12:53.220156 11514 nvc_info.c:301] selecting /usr/bin/nvidia-cuda-mps-control
I0318 10:12:53.220189 11514 nvc_info.c:301] selecting /usr/bin/nvidia-cuda-mps-server
W0318 10:12:53.220361 11514 nvc_info.c:427] missing binary nv-fabricmanager
I0318 10:12:53.220454 11514 nvc_info.c:487] listing firmware path /lib/firmware/nvidia/530.30.02/gsp_ga10x.bin
I0318 10:12:53.220465 11514 nvc_info.c:487] listing firmware path /lib/firmware/nvidia/530.30.02/gsp_tu10x.bin
I0318 10:12:53.220512 11514 nvc_info.c:560] listing device /dev/nvidiactl
I0318 10:12:53.220522 11514 nvc_info.c:560] listing device /dev/nvidia-uvm
I0318 10:12:53.220531 11514 nvc_info.c:560] listing device /dev/nvidia-uvm-tools
I0318 10:12:53.220541 11514 nvc_info.c:560] listing device /dev/nvidia-modeset
W0318 10:12:53.220587 11514 nvc_info.c:351] missing ipc path /var/run/nvidia-persistenced/socket
W0318 10:12:53.220625 11514 nvc_info.c:351] missing ipc path /var/run/nvidia-fabricmanager/socket
W0318 10:12:53.220651 11514 nvc_info.c:351] missing ipc path /tmp/nvidia-mps
I0318 10:12:53.220661 11514 nvc_info.c:853] requesting device information with ''
nvidia-container-cli: detection error: nvml error: unknown error
I0318 10:12:53.226723 11514 nvc.c:452] shutting down library context
I0318 10:12:53.226862 11520 rpc.c:95] terminating nvcgo rpc service
I0318 10:12:53.227618 11514 rpc.c:135] nvcgo rpc service terminated successfully
I0318 10:12:53.296640 11516 rpc.c:95] terminating driver rpc service
I0318 10:12:53.296725 11514 rpc.c:135] driver rpc service terminated successfully
elezar commented 7 months ago

@shuoshadow looking at the nvidia-smi output on the host, it seems as if there is a problem with the driver installation.

Could you please confirm the permissions of the device nodes:

ls -al /dev/nv*

And confirm that running sudo nvidia-smi gives different results (specifically that the name does not show ERR!).

shuoshadow commented 7 months ago

@elezar Thank you for your reply and this is the output of my execution. Is there any problem? image

shuoshadow commented 7 months ago

I upgraded the driver version to 550.54.14 and solved the problem. image