NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.25k stars 245 forks source link

nvidia-docker suddenly stop working OCI runtime create failed nvml error #292

Open jzhang82119 opened 3 years ago

jzhang82119 commented 3 years ago

Today my nvidia-docker commands stops working. I don't know what problem it is.

docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: nvml error: unknown error: unknown.

NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1
Kernel 5.4.0-73-generic Ubuntu 18.04.5 LTS

jzhang82119 commented 3 years ago

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: nvml error: unknown error: unknown.

nvidia-docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: nvml error: unknown error: unknown.

jzhang82119 commented 3 years ago

nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0528 01:01:28.078540 17674 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df) I0528 01:01:28.078613 17674 nvc.c:346] using root / I0528 01:01:28.078621 17674 nvc.c:347] using ldcache /etc/ld.so.cache I0528 01:01:28.078626 17674 nvc.c:348] using unprivileged user 65534:65534 I0528 01:01:28.078662 17674 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0528 01:01:28.078780 17674 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment I0528 01:01:28.082736 17675 nvc.c:274] loading kernel module nvidia I0528 01:01:28.082970 17675 nvc.c:278] running mknod for /dev/nvidiactl I0528 01:01:28.083014 17675 nvc.c:282] running mknod for /dev/nvidia0 I0528 01:01:28.083043 17675 nvc.c:282] running mknod for /dev/nvidia1 I0528 01:01:28.083066 17675 nvc.c:282] running mknod for /dev/nvidia2 I0528 01:01:28.083088 17675 nvc.c:282] running mknod for /dev/nvidia3 I0528 01:01:28.083108 17675 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps I0528 01:01:28.085774 17675 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I0528 01:01:28.085923 17675 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I0528 01:01:28.089810 17675 nvc.c:292] loading kernel module nvidia_uvm I0528 01:01:28.089841 17675 nvc.c:296] running mknod for /dev/nvidia-uvm I0528 01:01:28.089938 17675 nvc.c:301] loading kernel module nvidia_modeset I0528 01:01:28.090008 17675 nvc.c:305] running mknod for /dev/nvidia-modeset I0528 01:01:28.090360 17676 driver.c:101] starting driver service I0528 01:01:47.378772 17674 nvc_info.c:676] requesting driver information with '' I0528 01:01:47.380902 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.455.23.05 I0528 01:01:47.381134 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.455.23.05 I0528 01:01:47.381226 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.455.23.05 I0528 01:01:47.381269 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.455.23.05 I0528 01:01:47.381314 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.23.05 I0528 01:01:47.381374 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.455.23.05 I0528 01:01:47.381432 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.455.23.05 I0528 01:01:47.381467 17674 nvc_info.c:171] skipping /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.450.51.06 I0528 01:01:47.381506 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.455.23.05 I0528 01:01:47.381540 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.455.23.05 I0528 01:01:47.381588 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.455.23.05 I0528 01:01:47.381637 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.455.23.05 I0528 01:01:47.381672 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.455.23.05 I0528 01:01:47.381706 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.455.23.05 I0528 01:01:47.381740 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.455.23.05 I0528 01:01:47.381791 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.455.23.05 I0528 01:01:47.381842 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.455.23.05 I0528 01:01:47.381877 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.455.23.05 I0528 01:01:47.381908 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.455.23.05 I0528 01:01:47.381952 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.455.23.05 I0528 01:01:47.381991 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.455.23.05 I0528 01:01:47.382049 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.455.23.05 I0528 01:01:47.382433 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.455.23.05 I0528 01:01:47.382638 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.455.23.05 I0528 01:01:47.382677 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.455.23.05 I0528 01:01:47.382712 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.455.23.05 I0528 01:01:47.382750 17674 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.455.23.05 W0528 01:01:47.382779 17674 nvc_info.c:350] missing library libnvidia-nscq.so W0528 01:01:47.382787 17674 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so W0528 01:01:47.382802 17674 nvc_info.c:354] missing compat32 library libnvidia-ml.so W0528 01:01:47.382810 17674 nvc_info.c:354] missing compat32 library libnvidia-cfg.so W0528 01:01:47.382817 17674 nvc_info.c:354] missing compat32 library libnvidia-nscq.so W0528 01:01:47.382824 17674 nvc_info.c:354] missing compat32 library libcuda.so W0528 01:01:47.382831 17674 nvc_info.c:354] missing compat32 library libnvidia-opencl.so W0528 01:01:47.382839 17674 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so W0528 01:01:47.382850 17674 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so W0528 01:01:47.382859 17674 nvc_info.c:354] missing compat32 library libnvidia-allocator.so W0528 01:01:47.382866 17674 nvc_info.c:354] missing compat32 library libnvidia-compiler.so W0528 01:01:47.382873 17674 nvc_info.c:354] missing compat32 library libnvidia-ngx.so W0528 01:01:47.382880 17674 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so W0528 01:01:47.382888 17674 nvc_info.c:354] missing compat32 library libnvidia-encode.so W0528 01:01:47.382894 17674 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so W0528 01:01:47.382902 17674 nvc_info.c:354] missing compat32 library libnvcuvid.so W0528 01:01:47.382909 17674 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so W0528 01:01:47.382916 17674 nvc_info.c:354] missing compat32 library libnvidia-glcore.so W0528 01:01:47.382923 17674 nvc_info.c:354] missing compat32 library libnvidia-tls.so W0528 01:01:47.382931 17674 nvc_info.c:354] missing compat32 library libnvidia-glsi.so W0528 01:01:47.382940 17674 nvc_info.c:354] missing compat32 library libnvidia-fbc.so W0528 01:01:47.382947 17674 nvc_info.c:354] missing compat32 library libnvidia-ifr.so W0528 01:01:47.382956 17674 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so W0528 01:01:47.382965 17674 nvc_info.c:354] missing compat32 library libnvoptix.so W0528 01:01:47.382974 17674 nvc_info.c:354] missing compat32 library libGLX_nvidia.so W0528 01:01:47.382982 17674 nvc_info.c:354] missing compat32 library libEGL_nvidia.so W0528 01:01:47.382989 17674 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so W0528 01:01:47.383004 17674 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so W0528 01:01:47.383015 17674 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so W0528 01:01:47.383027 17674 nvc_info.c:354] missing compat32 library libnvidia-cbl.so I0528 01:01:47.383290 17674 nvc_info.c:276] selecting /usr/bin/nvidia-smi I0528 01:01:47.383311 17674 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump I0528 01:01:47.383331 17674 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced I0528 01:01:47.383364 17674 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control I0528 01:01:47.383383 17674 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server W0528 01:01:47.383421 17674 nvc_info.c:376] missing binary nv-fabricmanager I0528 01:01:47.383449 17674 nvc_info.c:438] listing device /dev/nvidiactl I0528 01:01:47.383456 17674 nvc_info.c:438] listing device /dev/nvidia-uvm I0528 01:01:47.383464 17674 nvc_info.c:438] listing device /dev/nvidia-uvm-tools I0528 01:01:47.383470 17674 nvc_info.c:438] listing device /dev/nvidia-modeset W0528 01:01:47.383503 17674 nvc_info.c:321] missing ipc /var/run/nvidia-persistenced/socket W0528 01:01:47.383525 17674 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket W0528 01:01:47.383543 17674 nvc_info.c:321] missing ipc /tmp/nvidia-mps I0528 01:01:47.383551 17674 nvc_info.c:733] requesting device information with '' nvidia-container-cli: detection error: nvml error: unknown error I0528 01:01:47.389646 17674 nvc.c:423] shutting down library context I0528 01:01:48.385932 17676 driver.c:163] terminating driver service I0528 01:01:48.386385 17674 driver.c:203] driver service terminated successfully

Kernel version 5.4.0-73-generic

Server: Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.8 Git commit: 20.10.2-0ubuntu1~18.04.2 Built: Mon Mar 29 19:27:41 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.3.3-0ubuntu1~18.04.4 GitCommit: runc: Version: spec: 1.0.2-dev GitCommit: docker-init: Version: 0.19.0 GitCommit:

jzhang82119 commented 3 years ago

I have done reinstalled docker/nvidia-docker. But that does not fix the error.

Still getting the same error.

nvidia-docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: nvml error: unknown error: unknown.

cantenna commented 3 years ago

I'm experiencing the exact same error here as well running the same versions, hope this gets sorted soon!

jzhang82119 commented 3 years ago

still no update. sad.

jzhang82119 commented 3 years ago

It happens on different version of nvidia driver as well 460, 462,470.

reinstalled nvidia drver and docker does not resolve this issue.

elezar commented 3 years ago

@jzhang82119 @cantenna sorry for the late response. Could you also include the output of nvidia-smi on the host. Is persistence mode enabled on the devices?

@jzhang82119 you mentioned that it was working before. Was there some system update that you executed before you started seeing this behaviour?

OlegJakushkin commented 1 year ago

@elezar Have a similar issue, described here with logs