NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.44k stars 258 forks source link

stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown. #183

Open chunniunai220ml opened 4 years ago

chunniunai220ml commented 4 years ago

I have config the docker 19.03.6 and nvidia-docker successfully.BUT ,when I test:

docker run --gpus all nvidia/cuda:10.0-base nvidia-smi GET errors :

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\\"\"": unknown.

then, I check the nvidia-container-cli ,it seems no error sudo nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0226 06:26:25.224982 78809 nvc.c:281] initializing library context (version=1.0.2, build=ff40da533db929bf515aca59ba4c701a65a35e6b) I0226 06:26:25.225050 78809 nvc.c:255] using root / I0226 06:26:25.225061 78809 nvc.c:256] using ldcache /etc/ld.so.cache I0226 06:26:25.225071 78809 nvc.c:257] using unprivileged user 65534:65534 I0226 06:26:25.230611 78810 nvc.c:191] loading kernel module nvidia I0226 06:26:25.230931 78810 nvc.c:203] loading kernel module nvidia_uvm I0226 06:26:25.231053 78810 nvc.c:211] loading kernel module nvidia_modeset I0226 06:26:25.231436 78811 driver.c:133] starting driver service I0226 06:26:25.356687 78809 nvc_info.c:434] requesting driver information with '' I0226 06:26:25.356983 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.418.87.00 I0226 06:26:25.357280 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.87.00 I0226 06:26:25.357333 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.87.00 I0226 06:26:25.357441 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.418.87.00 I0226 06:26:25.357512 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.87.00 I0226 06:26:25.357559 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.87.00 I0226 06:26:25.357629 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.418.87.00 I0226 06:26:25.357711 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.87.00 I0226 06:26:25.357760 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.87.00 I0226 06:26:25.357806 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.418.87.00 I0226 06:26:25.357868 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.87.00 I0226 06:26:25.357928 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.418.87.00 I0226 06:26:25.358002 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.87.00 I0226 06:26:25.358053 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.87.00 I0226 06:26:25.358108 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.87.00 I0226 06:26:25.358179 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.418.87.00 I0226 06:26:25.358606 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.418.87.00 I0226 06:26:25.358847 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.87.00 I0226 06:26:25.358902 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.87.00 I0226 06:26:25.358951 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.87.00 I0226 06:26:25.359001 78809 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.87.00 W0226 06:26:25.359039 78809 nvc_info.c:303] missing compat32 library libnvidia-ml.so W0226 06:26:25.359047 78809 nvc_info.c:303] missing compat32 library libnvidia-cfg.so W0226 06:26:25.359056 78809 nvc_info.c:303] missing compat32 library libcuda.so W0226 06:26:25.359066 78809 nvc_info.c:303] missing compat32 library libnvidia-opencl.so W0226 06:26:25.359076 78809 nvc_info.c:303] missing compat32 library libnvidia-ptxjitcompiler.so W0226 06:26:25.359086 78809 nvc_info.c:303] missing compat32 library libnvidia-fatbinaryloader.so W0226 06:26:25.359097 78809 nvc_info.c:303] missing compat32 library libnvidia-compiler.so W0226 06:26:25.359107 78809 nvc_info.c:303] missing compat32 library libvdpau_nvidia.so W0226 06:26:25.359117 78809 nvc_info.c:303] missing compat32 library libnvidia-encode.so W0226 06:26:25.359128 78809 nvc_info.c:303] missing compat32 library libnvidia-opticalflow.so W0226 06:26:25.359138 78809 nvc_info.c:303] missing compat32 library libnvcuvid.so W0226 06:26:25.359149 78809 nvc_info.c:303] missing compat32 library libnvidia-eglcore.so W0226 06:26:25.359159 78809 nvc_info.c:303] missing compat32 library libnvidia-glcore.so W0226 06:26:25.359169 78809 nvc_info.c:303] missing compat32 library libnvidia-tls.so W0226 06:26:25.359177 78809 nvc_info.c:303] missing compat32 library libnvidia-glsi.so W0226 06:26:25.359186 78809 nvc_info.c:303] missing compat32 library libnvidia-fbc.so W0226 06:26:25.359194 78809 nvc_info.c:303] missing compat32 library libnvidia-ifr.so W0226 06:26:25.359203 78809 nvc_info.c:303] missing compat32 library libGLX_nvidia.so W0226 06:26:25.359212 78809 nvc_info.c:303] missing compat32 library libEGL_nvidia.so W0226 06:26:25.359220 78809 nvc_info.c:303] missing compat32 library libGLESv2_nvidia.so W0226 06:26:25.359253 78809 nvc_info.c:303] missing compat32 library libGLESv1_CM_nvidia.so I0226 06:26:25.359527 78809 nvc_info.c:229] selecting /usr/bin/nvidia-smi I0226 06:26:25.359560 78809 nvc_info.c:229] selecting /usr/bin/nvidia-debugdump I0226 06:26:25.359585 78809 nvc_info.c:229] selecting /usr/bin/nvidia-persistenced I0226 06:26:25.359608 78809 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-control I0226 06:26:25.359632 78809 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-server I0226 06:26:25.359667 78809 nvc_info.c:366] listing device /dev/nvidiactl I0226 06:26:25.359676 78809 nvc_info.c:366] listing device /dev/nvidia-uvm I0226 06:26:25.359687 78809 nvc_info.c:366] listing device /dev/nvidia-uvm-tools I0226 06:26:25.359697 78809 nvc_info.c:366] listing device /dev/nvidia-modeset W0226 06:26:25.359731 78809 nvc_info.c:274] missing ipc /var/run/nvidia-persistenced/socket W0226 06:26:25.359753 78809 nvc_info.c:274] missing ipc /tmp/nvidia-mps I0226 06:26:25.359763 78809 nvc_info.c:490] requesting device information with '' I0226 06:26:25.366457 78809 nvc_info.c:520] listing device /dev/nvidia0 (GPU-03bb5927-ceaa-4166-ff1e-1d58a8cbf883 at 00000000:05:00.0) I0226 06:26:25.373129 78809 nvc_info.c:520] listing device /dev/nvidia1 (GPU-26602c4d-2069-84f3-3bc9-5d943fb3bdb4 at 00000000:06:00.0) I0226 06:26:25.380167 78809 nvc_info.c:520] listing device /dev/nvidia2 (GPU-0687efee-81a2-537e-d7fe-3a5694aceb29 at 00000000:85:00.0) I0226 06:26:25.387215 78809 nvc_info.c:520] listing device /dev/nvidia3 (GPU-4c95eb5b-8940-562c-742f-2078cb3a02eb at 00000000:86:00.0) NVRM version: 418.87.00 CUDA version: 10.1

Device Index: 0 Device Minor: 0 Model: Tesla K80 Brand: Tesla GPU UUID: GPU-03bb5927-ceaa-4166-ff1e-1d58a8cbf883 Bus Location: 00000000:05:00.0 Architecture: 3.7

Device Index: 1 Device Minor: 1 Model: Tesla K80 Brand: Tesla GPU UUID: GPU-26602c4d-2069-84f3-3bc9-5d943fb3bdb4 Bus Location: 00000000:06:00.0 Architecture: 3.7

Device Index: 2 Device Minor: 2 Model: Tesla K80 Brand: Tesla GPU UUID: GPU-0687efee-81a2-537e-d7fe-3a5694aceb29 Bus Location: 00000000:85:00.0 Architecture: 3.7

Device Index: 3 Device Minor: 3 Model: Tesla K80 Brand: Tesla GPU UUID: GPU-4c95eb5b-8940-562c-742f-2078cb3a02eb Bus Location: 00000000:86:00.0 Architecture: 3.7 I0226 06:26:25.387330 78809 nvc.c:318] shutting down library context I0226 06:26:25.388428 78811 driver.c:192] terminating driver service I0226 06:26:25.440777 78809 driver.c:233] driver service terminated successfully

is the nvidia-driver-version too low? in fact,the 418.87.00 is the nvidia official network recommend, and how to update the driver by apt instead of mannually with the driver-run file? I do not konw how to make it works. can anyone help me?

chunniunai220ml commented 4 years ago

and I reinstall the nvidia-driver by NVIDIA-Linux-x86_64-440.33.01.run, meet the same error.

soheilade commented 4 years ago

same problem here on


Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:    18.04
Codename:   bionic

My docker is Docker version 19.03.6, build 369ce74a3c and I installed nvidia driver from here. When I run sudo nvidia-container-cli -k -d /dev/tty info The output is

I0228 09:13:49.695833 1120 nvc.c:281] initializing library context (version=1.0.7, build=b71f87c04b8eca8a16bf60995506c35c937347d9)
I0228 09:13:49.695933 1120 nvc.c:255] using root /
I0228 09:13:49.695948 1120 nvc.c:256] using ldcache /etc/ld.so.cache
I0228 09:13:49.695958 1120 nvc.c:257] using unprivileged user 65534:65534
I0228 09:13:49.696847 1121 nvc.c:191] loading kernel module nvidia
E0228 09:13:50.186352 1121 nvc.c:193] could not load kernel module nvidia
I0228 09:13:50.186425 1121 nvc.c:203] loading kernel module nvidia_uvm
E0228 09:13:50.628481 1121 nvc.c:205] could not load kernel module nvidia_uvm
I0228 09:13:50.628508 1121 nvc.c:211] loading kernel module nvidia_modeset
E0228 09:13:51.064044 1121 nvc.c:213] could not load kernel module nvidia_modeset
I0228 09:13:51.064251 1129 driver.c:133] starting driver service
I0228 09:13:51.066557 1120 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: cuda error: unknown error

the output of my attempt to run docker run --gpus all nvidia/cuda:10.0-base nvidia-smi is as follows

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/defdc438de52aef6ec0266539ea834320a9580f75bac6b71cfd2d2e3c999aae9/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled 

any idea?

chunniunai220ml commented 4 years ago

@soheilade have you solved the problem?

soheilade commented 4 years ago

yeah, try reinstalling nvidia driver from here and run this docker command to launch carla server in a docker container docker run -p 2000-2002:2000-2002 --rm -d -it -e NVIDIA_VISIBLE_DEVICES=0 --runtime nvidia carlasim/carla:0.9.5 ./CarlaUE4.sh /Game/Maps/Town01

RenaudWasTaken commented 4 years ago

This point to an error with the driver. Can you install the CUDA samples on the host machine and try to run for example deviceQuery?

chunniunai220ml commented 4 years ago

@RenaudWasTaken I think I have insalled the driver suceessful, I can use tensorfow1.14.0 in the host machine. and I run commands as follows: 1.cat /proc/driver/nvidia/version, shows:

NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.33.01 Wed Nov 13 00:00:22 UTC 2019 GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)

2.sudo dpkg --list | grep nvidia-*,shows:

iU libnvidia-container-tools 1.0.7-1 amd64 NVIDIA container runtime library (command-line tools) iU libnvidia-container1:amd64 1.0.7-1 amd64 NVIDIA container runtime library

  1. run the deviceQuery ,shows:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla K80" CUDA Driver Version / Runtime Version 10.2 / 9.0 CUDA Capability Major/Minor version number: 3.7 Total amount of global memory: 11441 MBytes (11996954624 bytes) (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores GPU Max Clock rate: 824 MHz (0.82 GHz) Memory Clock rate: 2505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K80" CUDA Driver Version / Runtime Version 10.2 / 9.0 CUDA Capability Major/Minor version number: 3.7 Total amount of global memory: 11441 MBytes (11996954624 bytes) (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores GPU Max Clock rate: 824 MHz (0.82 GHz) Memory Clock rate: 2505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "Tesla K80" CUDA Driver Version / Runtime Version 10.2 / 9.0 CUDA Capability Major/Minor version number: 3.7 Total amount of global memory: 11441 MBytes (11996954624 bytes) (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores GPU Max Clock rate: 824 MHz (0.82 GHz) Memory Clock rate: 2505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 133 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "Tesla K80" CUDA Driver Version / Runtime Version 10.2 / 9.0 CUDA Capability Major/Minor version number: 3.7 Total amount of global memory: 11441 MBytes (11996954624 bytes) (13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores GPU Max Clock rate: 824 MHz (0.82 GHz) Memory Clock rate: 2505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 134 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : No Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : No Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : No Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : No Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : No Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : No Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : Yes Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : No Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : No Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 9.0, NumDevs = 4 Result = PASS

what's wrong with these information? I do not find something.

RenaudWasTaken commented 4 years ago

Ok, nothing wrong with CUDA, the other two that might help are:

chunniunai220ml commented 4 years ago
  1. in fact, I do not know how to use vectoraddDrv. cd /usr/local/cuda/samples/0_Simple/vectorAddDrv, then sudo make, generate vectorAddDrv*
  2. sudo nvidia-bug-report.sh, generate nvidia-bug-report.log.gz, some errors as follows: ff:15.2 System peripheral [0880]: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 Integrated Memory Controller 0 Channel 2 ERROR Registers [8086:2fb6] (rev 02)

I think this is not critical error, and what information should I look at ?

chunniunai220ml commented 4 years ago

@RenaudWasTaken The problem has not been solved for me, can you give me futher help?

ReyRen commented 4 years ago

same error occur to me~~~heh

harendracmaps commented 4 years ago

Same problem I am facing as well

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

tx2-01:~$ uname -a

Linux jetson-tx2-01 4.9.140-tegra NVIDIA/nvidia-docker#1 SMP PREEMPT Mon Aug 12 21:29:52 PDT 2019 aarch64 aarch64 aarch64 GNU/Linux

tx2-01:~$ sudo nvidia-container-cli -k -d /dev/tty info [sudo] password for civilmaps:

-- WARNING, the following logs are for debugging purposes only --

I0609 06:28:32.004669 8657 nvc.c:281] initializing library context (version=1.1.1, build=e5d6156aba457559979597c8e3d22c5d8d0622db)
I0609 06:28:32.004901 8657 nvc.c:255] using root /
I0609 06:28:32.004930 8657 nvc.c:256] using ldcache /etc/ld.so.cache
I0609 06:28:32.004947 8657 nvc.c:257] using unprivileged user 65534:65534
W0609 06:28:32.005415 8657 nvc.c:171] failed to detect NVIDIA devices
I0609 06:28:32.005723 8658 nvc.c:191] loading kernel module nvidia
E0609 06:28:32.006013 8658 nvc.c:193] could not load kernel module nvidia
I0609 06:28:32.006037 8658 nvc.c:203] loading kernel module nvidia_uvm
E0609 06:28:32.006142 8658 nvc.c:205] could not load kernel module nvidia_uvm
I0609 06:28:32.006161 8658 nvc.c:211] loading kernel module nvidia_modeset
E0609 06:28:32.006259 8658 nvc.c:213] could not load kernel module nvidia_modeset
I0609 06:28:32.007119 8659 driver.c:101] starting driver service
E0609 06:28:32.009737 8659 driver.c:161] could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
I0609 06:28:32.010706 8657 driver.c:196] driver service terminated successfully
nvidia-container-cli: initialization error: driver error: failed to process request

tx2-01:~$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 
paldana-ISI commented 4 years ago

Same problem I am facing as well

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

tx2-01:~$ uname -a

Linux jetson-tx2-01 4.9.140-tegra NVIDIA/nvidia-docker#1 SMP PREEMPT Mon Aug 12 21:29:52 PDT 2019 aarch64 aarch64 aarch64 GNU/Linux

tx2-01:~$ sudo nvidia-container-cli -k -d /dev/tty info [sudo] password for civilmaps:

-- WARNING, the following logs are for debugging purposes only --

I0609 06:28:32.004669 8657 nvc.c:281] initializing library context (version=1.1.1, build=e5d6156aba457559979597c8e3d22c5d8d0622db)
I0609 06:28:32.004901 8657 nvc.c:255] using root /
I0609 06:28:32.004930 8657 nvc.c:256] using ldcache /etc/ld.so.cache
I0609 06:28:32.004947 8657 nvc.c:257] using unprivileged user 65534:65534
W0609 06:28:32.005415 8657 nvc.c:171] failed to detect NVIDIA devices
I0609 06:28:32.005723 8658 nvc.c:191] loading kernel module nvidia
E0609 06:28:32.006013 8658 nvc.c:193] could not load kernel module nvidia
I0609 06:28:32.006037 8658 nvc.c:203] loading kernel module nvidia_uvm
E0609 06:28:32.006142 8658 nvc.c:205] could not load kernel module nvidia_uvm
I0609 06:28:32.006161 8658 nvc.c:211] loading kernel module nvidia_modeset
E0609 06:28:32.006259 8658 nvc.c:213] could not load kernel module nvidia_modeset
I0609 06:28:32.007119 8659 driver.c:101] starting driver service
E0609 06:28:32.009737 8659 driver.c:161] could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
I0609 06:28:32.010706 8657 driver.c:196] driver service terminated successfully
nvidia-container-cli: initialization error: driver error: failed to process request

tx2-01:~$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 

@harendracmaps Have you solved your issue? I'm having the same exact error except I'm running it on an NVIDIA Xavier AGX

Running on the following specs:

nvidia@x02:~$ uname -a
Linux x02 4.9.140-tegra NVIDIA/nvidia-docker#1 SMP PREEMPT Mon Dec 9 22:52:02 PST 2019 aarch64 aarch64 aarch64 GNU/Linux
$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 3.1, GCID: 18186506, BOARD: t186ref, EABI: aarch64, DATE: Tue Dec 10 07:03:07 UTC 2019
nvidia@x02:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Mon_Mar_11_22:13:24_CDT_2019
Cuda compilation tools, release 10.0, V10.0.326
nvidia@x02:~$ dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version               Architecture          Description
+++-================================-=====================-=====================-======================================================================
un  libgldispatch0-nvidia            <none>                <none>                (no description available)
ii  libnvidia-container-tools        1.2.0-1               arm64                 NVIDIA container runtime library (command-line tools)
ii  libnvidia-container0:arm64       0.9.0~beta.1          arm64                 NVIDIA container runtime library
ii  libnvidia-container1:arm64       1.2.0-1               arm64                 NVIDIA container runtime library
un  nvidia-304                       <none>                <none>                (no description available)
un  nvidia-340                       <none>                <none>                (no description available)
un  nvidia-384                       <none>                <none>                (no description available)
un  nvidia-common                    <none>                <none>                (no description available)
ii  nvidia-container-csv-cuda        10.0.326-1            arm64                 Jetpack CUDA CSV file
ii  nvidia-container-csv-cudnn       7.6.3.28-1+cuda10.0   arm64                 Jetpack CUDNN CSV file
ii  nvidia-container-csv-tensorrt    6.0.1.10-1+cuda10.0   arm64                 Jetpack TensorRT CSV file
ii  nvidia-container-csv-visionworks 1.6.0.500n            arm64                 Jetpack VisionWorks CSV file
ii  nvidia-container-runtime         3.1.0-1               arm64                 NVIDIA container runtime
un  nvidia-container-runtime-hook    <none>                <none>                (no description available)
ii  nvidia-container-toolkit         1.2.1-1               arm64                 NVIDIA container runtime hook
un  nvidia-cuda-dev                  <none>                <none>                (no description available)
un  nvidia-docker                    <none>                <none>                (no description available)
ii  nvidia-docker2                   2.2.0-1               all                   nvidia-docker CLI wrapper
ii  nvidia-l4t-3d-core               32.3.1-20191209230245 arm64                 NVIDIA GL EGL Package
ii  nvidia-l4t-apt-source            32.3.1-20191209230245 arm64                 NVIDIA L4T apt source list debian package
ii  nvidia-l4t-bootloader            32.3.1-20191209230245 arm64                 NVIDIA Bootloader Package
ii  nvidia-l4t-camera                32.3.1-20191209230245 arm64                 NVIDIA Camera Package
ii  nvidia-l4t-ccp-t186ref           32.3.1-20191209230245 arm64                 NVIDIA Compatibility Checking Package
ii  nvidia-l4t-configs               32.3.1-20191209230245 arm64                 NVIDIA configs debian package
ii  nvidia-l4t-core                  32.3.1-20191209230245 arm64                 NVIDIA Core Package
ii  nvidia-l4t-cuda                  32.3.1-20191209230245 arm64                 NVIDIA CUDA Package
ii  nvidia-l4t-firmware              32.3.1-20191209230245 arm64                 NVIDIA Firmware Package
ii  nvidia-l4t-graphics-demos        32.3.1-20191209230245 arm64                 NVIDIA graphics demo applications
ii  nvidia-l4t-gstreamer             32.3.1-20191209230245 arm64                 NVIDIA GST Application files
ii  nvidia-l4t-init                  32.3.1-20191209230245 arm64                 NVIDIA Init debian package
ii  nvidia-l4t-initrd                32.3.1-20191209230245 arm64                 NVIDIA initrd debian package
ii  nvidia-l4t-jetson-io             32.3.1-20191209230245 arm64                 NVIDIA Jetson.IO debian package
ii  nvidia-l4t-jetson-multimedia-api 32.3.1-20191209230245 arm64                 NVIDIA Jetson Multimedia API is a collection of lower-level APIs that
ii  nvidia-l4t-kernel                4.9.140-tegra-32.3.1- arm64                 NVIDIA Kernel Package
ii  nvidia-l4t-kernel-dtbs           4.9.140-tegra-32.3.1- arm64                 NVIDIA Kernel DTB Package
ii  nvidia-l4t-kernel-headers        4.9.140-tegra-32.3.1- arm64                 NVIDIA Linux Tegra Kernel Headers Package
ii  nvidia-l4t-multimedia            32.3.1-20191209230245 arm64                 NVIDIA Multimedia Package
ii  nvidia-l4t-multimedia-utils      32.3.1-20191209230245 arm64                 NVIDIA Multimedia Package
ii  nvidia-l4t-oem-config            32.3.1-20191209230245 arm64                 NVIDIA OEM-Config Package
ii  nvidia-l4t-tools                 32.3.1-20191209230245 arm64                 NVIDIA Public Test Tools Package
ii  nvidia-l4t-wayland               32.3.1-20191209230245 arm64                 NVIDIA Wayland Package
ii  nvidia-l4t-weston                32.3.1-20191209230245 arm64                 NVIDIA Weston Package
ii  nvidia-l4t-x11                   32.3.1-20191209230245 arm64                 NVIDIA X11 Package
ii  nvidia-l4t-xusb-firmware         32.3.1-20191209230245 arm64                 NVIDIA USB Firmware Package
un  nvidia-libopencl1-dev            <none>                <none>                (no description available)
un  nvidia-prime                     <none>                <none>                (no description available)
mildsunrise commented 3 years ago

if you get this error on a Jetson board:

could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory

Then it means you've installed nvidia-container-toolkit from the official repos (https://nvidia.github.io/nvidia-docker). nvidia-container-toolkit does not support Jetson right now, but there is a beta version in the jetpack repos that does. Remove the nvidia-docker repo, then reinstall nvidia-container-runtime and nvidia-jetpack.

tekh commented 3 years ago

if you get this error on a Jetson board:

could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory

Then it means you've installed nvidia-container-toolkit from the official repos (https://nvidia.github.io/nvidia-docker). nvidia-container-toolkit does not support Jetson right now, but there is a beta version in the jetpack repos that does. Remove the nvidia-docker repo, then reinstall nvidia-container-runtime and nvidia-jetpack.

Thank you!!

itzk-sgh commented 3 years ago

upgrade nvida-docker version to nvidia-docker2-2.5.0,the problem is solved perfectly.

CUDA Version: 11.0 docker-ce: 19.03.7 nvidia-docker2-2.5.0-1

ghost commented 3 years ago

@mildsunrise It seems nvidia-docker supports jetson but I am still getting this error even with nvidia-docker2-2.5.0-1

mildsunrise commented 3 years ago

what makes you think nvidia-docker supports jetson? the FAQ still says you need the SDK manager (aka the jetson repos). you need the version of nvidia-docker2 that comes with the jetson repos, not the nvidia-docker one

ghost commented 3 years ago

@mildsunrise ah you mean "jetpack" by "jetson repos" don't you? So that might very well be my issue. I am using the stock kernel ConnectTech provides and presumed because it had L4T 32.4.4 installed that it had Jetpack 4.2.2 installed, but I think I need to reflash it because the manufacturer probably does just a minimal install for QA purposes.

https://github.com/NVIDIA/nvidia-docker/wiki/NVIDIA-Container-Runtime-on-Jetson

AlexAshs commented 3 years ago

I am getting the same error with nvidia-container-toolkit/bionic,now 1.5.1-1 amd64 under Ubuntu Server 20.04 LTS, running headless. I installed the nvidia drivers via the .run file, downloaded from the official nvidia page and nvidia-smi is working, as is the hashcat benchmark.

However, when I run docker run --rm --gpus all nvidia/cuda:11.1-base nvidia-smi with docker-ce/focal,now 5:20.10.7~3-0~ubuntu-focal amd64, I get the aforementioned error docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.

I already tried newer drivers from the repositories up until version 470, but nothing worked.

Any ideas?

TheMarshalMole commented 3 years ago

It seems that NVidia continues ignoring Linux support

elezar commented 3 years ago

@AlexAshs sorry for the delay in getting back to you. Would you mind creating a new ticket and including the debug output from /var/log/nvidia-container-toolkit.log? This logging can be enabled by uncommenting the #debug= line in the nvidia-container-cli section of the /etc/nvidia-container-runtime/config.toml file.

The reason I ask for a new issue is that this one has gotten quite long and seems to contain a mix of issues related to Jetson platforms and others that have been marked as fixed.

AlexAshs commented 3 years ago

@elezar No worrys, I was just trying things out, since this is my first dedicated GPU, so nothing in production just yet :D I have posted my issue here Containers with gpus not starting up. I really don't post issues often, I prefer finding solutoins first, so if there is something missing or the title sucks, just let me know, so I can provide what is necessary to tackle this.

TheMarshalMole commented 3 years ago

I have the exact same problem.

Configuration: Host: Windows 10 with WSL2, with CUDA installed.

Error: docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request: unknown.

Command: docker run --gpus all --cpus 2 --name test -it pytorch/pytorch

Hardware: NVidia GeForce GTX 1660 TI

Any solution to this program?

AlexAshs commented 3 years ago

Are you installing updates from the windows insider dev channel? It seems to be a requirement for this setup to work.

TheMarshalMole commented 3 years ago

@AlexAshs I am not signed for insiders program. May you tell me which update is necessary? I will install it manually

AlexAshs commented 3 years ago

@TheMarshalMole I found this guide, that should make things easier for you: https://www.forecr.io/blogs/installation/nvidia-docker-installation-for-ubuntu-in-wsl-2

anajar2198 commented 3 years ago

I installed my GPU driver from Software and update >> additional drivers and It solved my problem

jinmiaoluo commented 1 year ago

My local environment is as follows:

a virtual machine( KVM + qemu + libvirtd ) running Arch Linux, accessing the host's RTX 3090 graphics card through PCI passthrough.

Programs in the virtual machine access the graphics card via Docker, resulting in the error mentioned in the title.

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.
imorti commented 1 year ago

I just downloaded the Triton Inference Server repo and ran into same on Mac. Ventura 13.4.1 (c).

Ran (as per instructions): docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.06-py3 tritonserver --model-repository=/models

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.