NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.46k stars 264 forks source link

Follow official wiki but cannot run nvidia/cuda:11.0-base docker after running nvidia/driver:460.32.03-ubuntu16.04 #184

Open junwang-wish opened 2 years ago

junwang-wish commented 2 years ago

1. Issue or feature description

Cannot run nvidia/cuda:11.0-base docker after running nvidia/driver:460.32.03-ubuntu16.04 on Ubuntu 16.04 x86_64:

junwang@dgxone01:~$ sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\\\\nnvidia-container-cli: initialization error: change root failed: no such file or directory\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled

2. Steps to reproduce the issue

Install according to official wiki https://docs.nvidia.com/datacenter/cloud-native/driver-containers/overview.html leads to the error above:

sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
  && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
  && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"

sudo update-initramfs -u

sudo apt-get dist-upgrade # optional

sudo reboot

sudo docker run --name nvidia-driver -d --privileged --pid=host \
  -v /run/nvidia:/run/nvidia:shared \
  -v /var/log:/var/log \
  --restart=unless-stopped \
  nvidia/driver:460.32.03-ubuntu16.04 # after reboot

sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi # after starting nvidia driver container

3. Information to attach (optional if deemed irrelevant)

-- WARNING, the following logs are for debugging purposes only --

I0917 05:14:00.494137 40446 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I0917 05:14:00.494198 40446 nvc.c:350] using root / I0917 05:14:00.494204 40446 nvc.c:351] using ldcache /etc/ld.so.cache I0917 05:14:00.494214 40446 nvc.c:352] using unprivileged user 6520:500 I0917 05:14:00.494255 40446 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0917 05:14:00.494328 40446 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0917 05:14:00.498912 40447 nvc.c:273] failed to set inheritable capabilities W0917 05:14:00.498968 40447 nvc.c:274] skipping kernel modules load due to failure I0917 05:14:00.499450 40448 rpc.c:71] starting driver rpc service I0917 05:14:00.503598 40449 rpc.c:71] starting nvcgo rpc service I0917 05:14:00.513472 40446 nvc_info.c:766] requesting driver information with '' I0917 05:14:00.517445 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/vdpau/libvdpau_nvidia.so.384.183 I0917 05:14:00.517973 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/tls/libnvidia-tls.so.384.183 I0917 05:14:00.518290 40446 nvc_info.c:175] skipping /usr/lib/nvidia-384/libnvidia-tls.so.384.183 I0917 05:14:00.518994 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-ptxjitcompiler.so.384.183 I0917 05:14:00.519746 40446 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.384.183 I0917 05:14:00.519844 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-ml.so.384.183 I0917 05:14:00.520493 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-ifr.so.384.183 I0917 05:14:00.521268 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-glsi.so.384.183 I0917 05:14:00.521921 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-glcore.so.384.183 I0917 05:14:00.522505 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-fbc.so.384.183 I0917 05:14:00.522602 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-fatbinaryloader.so.384.183 I0917 05:14:00.523201 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-encode.so.384.183 I0917 05:14:00.523850 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-eglcore.so.384.183 I0917 05:14:00.524466 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-compiler.so.384.183 I0917 05:14:00.524599 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvidia-cfg.so.384.183 I0917 05:14:00.525252 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libnvcuvid.so.384.183 I0917 05:14:00.525457 40446 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.384.183 I0917 05:14:00.526100 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libGLX_nvidia.so.384.183 I0917 05:14:00.526727 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libGLESv2_nvidia.so.384.183 I0917 05:14:00.527292 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libGLESv1_CM_nvidia.so.384.183 I0917 05:14:00.527896 40446 nvc_info.c:173] selecting /usr/lib/nvidia-384/libEGL_nvidia.so.384.183 I0917 05:14:00.530021 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/vdpau/libvdpau_nvidia.so.384.183 I0917 05:14:00.530449 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/tls/libnvidia-tls.so.384.183 I0917 05:14:00.530833 40446 nvc_info.c:175] skipping /usr/lib32/nvidia-384/libnvidia-tls.so.384.183 I0917 05:14:00.531433 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-ptxjitcompiler.so.384.183 I0917 05:14:00.532065 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-ml.so.384.183 I0917 05:14:00.532783 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-ifr.so.384.183 I0917 05:14:00.533423 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-glsi.so.384.183 I0917 05:14:00.533999 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-glcore.so.384.183 I0917 05:14:00.534516 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-fbc.so.384.183 I0917 05:14:00.535107 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-fatbinaryloader.so.384.183 I0917 05:14:00.535738 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-encode.so.384.183 I0917 05:14:00.536380 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-eglcore.so.384.183 I0917 05:14:00.537056 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-compiler.so.384.183 I0917 05:14:00.537739 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvidia-cfg.so.384.183 I0917 05:14:00.538381 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libnvcuvid.so.384.183 I0917 05:14:00.539044 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libGLX_nvidia.so.384.183 I0917 05:14:00.539547 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libGLESv2_nvidia.so.384.183 I0917 05:14:00.540008 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libGLESv1_CM_nvidia.so.384.183 I0917 05:14:00.540588 40446 nvc_info.c:173] selecting /usr/lib32/nvidia-384/libEGL_nvidia.so.384.183 W0917 05:14:00.540674 40446 nvc_info.c:399] missing library libnvidia-nscq.so W0917 05:14:00.540683 40446 nvc_info.c:399] missing library libcudadebugger.so W0917 05:14:00.540707 40446 nvc_info.c:399] missing library libnvidia-allocator.so W0917 05:14:00.540714 40446 nvc_info.c:399] missing library libnvidia-pkcs11.so W0917 05:14:00.540722 40446 nvc_info.c:399] missing library libnvidia-ngx.so W0917 05:14:00.540726 40446 nvc_info.c:399] missing library libnvidia-opticalflow.so W0917 05:14:00.540731 40446 nvc_info.c:399] missing library libnvidia-rtcore.so W0917 05:14:00.540736 40446 nvc_info.c:399] missing library libnvoptix.so W0917 05:14:00.540740 40446 nvc_info.c:399] missing library libnvidia-glvkspirv.so W0917 05:14:00.540745 40446 nvc_info.c:399] missing library libnvidia-cbl.so W0917 05:14:00.540749 40446 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0917 05:14:00.540754 40446 nvc_info.c:403] missing compat32 library libcuda.so W0917 05:14:00.540759 40446 nvc_info.c:403] missing compat32 library libcudadebugger.so W0917 05:14:00.540766 40446 nvc_info.c:403] missing compat32 library libnvidia-opencl.so W0917 05:14:00.540771 40446 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0917 05:14:00.540775 40446 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0917 05:14:00.540780 40446 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0917 05:14:00.540785 40446 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so W0917 05:14:00.540790 40446 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0917 05:14:00.540794 40446 nvc_info.c:403] missing compat32 library libnvoptix.so W0917 05:14:00.540799 40446 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W0917 05:14:00.540804 40446 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0917 05:14:00.541464 40446 nvc_info.c:299] selecting /usr/lib/nvidia-384/bin/nvidia-smi I0917 05:14:00.541522 40446 nvc_info.c:299] selecting /usr/lib/nvidia-384/bin/nvidia-debugdump I0917 05:14:00.541559 40446 nvc_info.c:299] selecting /usr/lib/nvidia-384/bin/nvidia-persistenced I0917 05:14:00.541613 40446 nvc_info.c:299] selecting /usr/lib/nvidia-384/bin/nvidia-cuda-mps-control I0917 05:14:00.541665 40446 nvc_info.c:299] selecting /usr/lib/nvidia-384/bin/nvidia-cuda-mps-server W0917 05:14:00.541734 40446 nvc_info.c:425] missing binary nv-fabricmanager W0917 05:14:00.541903 40446 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/384.183/gsp.bin W0917 05:14:00.541938 40446 nvc_info.c:323] missing device /dev/nvidia-uvm-tools I0917 05:14:00.541944 40446 nvc_info.c:529] listing device /dev/nvidiactl I0917 05:14:00.541949 40446 nvc_info.c:529] listing device /dev/nvidia-uvm I0917 05:14:00.541953 40446 nvc_info.c:529] listing device /dev/nvidia-modeset I0917 05:14:00.541980 40446 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket W0917 05:14:00.542000 40446 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0917 05:14:00.542025 40446 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0917 05:14:00.542031 40446 nvc_info.c:822] requesting device information with '' I0917 05:14:00.549426 40446 nvc_info.c:713] listing device /dev/nvidia0 (GPU-2efa8c8c-851c-3266-76c1-578547b1cfe5 at 00000000:06:00.0) I0917 05:14:00.556631 40446 nvc_info.c:713] listing device /dev/nvidia1 (GPU-f15a69e0-3256-e324-96b7-8139169979a0 at 00000000:07:00.0) I0917 05:14:00.564062 40446 nvc_info.c:713] listing device /dev/nvidia2 (GPU-b6c1f25b-5a9d-0ce6-1e96-1a6a0d26c436 at 00000000:0a:00.0) I0917 05:14:00.572068 40446 nvc_info.c:713] listing device /dev/nvidia3 (GPU-d5a6c4ee-9c48-06c9-e4fe-5d7bbace6a71 at 00000000:0b:00.0) I0917 05:14:00.579678 40446 nvc_info.c:713] listing device /dev/nvidia4 (GPU-61368161-da63-1181-4073-381358a8cb7e at 00000000:85:00.0) I0917 05:14:00.587339 40446 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c0780f96-4e2f-fdb6-ec6d-f2db29e07add at 00000000:86:00.0) I0917 05:14:00.595415 40446 nvc_info.c:713] listing device /dev/nvidia6 (GPU-dae39496-c66f-e636-a442-9f249615bcdf at 00000000:89:00.0) I0917 05:14:00.603459 40446 nvc_info.c:713] listing device /dev/nvidia7 (GPU-6c7e6417-39d6-0540-417d-6de2548ad1bf at 00000000:8a:00.0) NVRM version: 384.183 CUDA version: 9.0

Device Index: 0 Device Minor: 0 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-2efa8c8c-851c-3266-76c1-578547b1cfe5 Bus Location: 00000000:06:00.0 Architecture: 7.0

Device Index: 1 Device Minor: 1 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-f15a69e0-3256-e324-96b7-8139169979a0 Bus Location: 00000000:07:00.0 Architecture: 7.0

Device Index: 2 Device Minor: 2 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-b6c1f25b-5a9d-0ce6-1e96-1a6a0d26c436 Bus Location: 00000000:0a:00.0 Architecture: 7.0

Device Index: 3 Device Minor: 3 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-d5a6c4ee-9c48-06c9-e4fe-5d7bbace6a71 Bus Location: 00000000:0b:00.0 Architecture: 7.0

Device Index: 4 Device Minor: 4 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-61368161-da63-1181-4073-381358a8cb7e Bus Location: 00000000:85:00.0 Architecture: 7.0

Device Index: 5 Device Minor: 5 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-c0780f96-4e2f-fdb6-ec6d-f2db29e07add Bus Location: 00000000:86:00.0 Architecture: 7.0

Device Index: 6 Device Minor: 6 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-dae39496-c66f-e636-a442-9f249615bcdf Bus Location: 00000000:89:00.0 Architecture: 7.0

Device Index: 7 Device Minor: 7 Model: Tesla V100-SXM2-16GB Brand: Tesla GPU UUID: GPU-6c7e6417-39d6-0540-417d-6de2548ad1bf Bus Location: 00000000:8a:00.0 Architecture: 7.0 I0917 05:14:00.603619 40446 nvc.c:434] shutting down library context I0917 05:14:00.603692 40449 rpc.c:95] terminating nvcgo rpc service I0917 05:14:00.604356 40446 rpc.c:135] nvcgo rpc service terminated successfully I0917 05:14:00.607329 40448 rpc.c:95] terminating driver rpc service I0917 05:14:00.607498 40446 rpc.c:135] driver rpc service terminated successfully


 - [ x] Kernel version from `uname -a`
```shell
junwang@dgxone01:~$ uname -a
Linux dgxone01.i.wish.com 4.4.0-142-generic NVIDIA/nvidia-docker#168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
elezar commented 2 years ago

@junwang-wish if you are using the driver contianer, you need to set the root in your /etc/nvidia-container-runtime/config.toml.

Since you are launching the driver container with:

-v /run/nvidia:/run/nvidia:shared \

This would mean that you need to set /root to:

root = /run/nvidia/driver
junwang-wish commented 2 years ago

Thanks @elezar , however given my config.toml has set the correct path (shown below), the same error occurs:

junwang@dgxone01:~$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
junwang@dgxone01:~$ sudo docker run --name nvidia-driver -d --privileged --pid=host \
>   -v /run/nvidia:/run/nvidia:shared \
>   -v /var/log:/var/log \
>   --restart=unless-stopped \
>   nvidia/driver:460.32.03-ubuntu16.04
c6d34d32d0af0c47b48545e50b55b9c9d3baa1946e8461b0712c957bc71802fc
junwang@dgxone01:~$ sudo docker run --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\\\\nnvidia-container-cli: initialization error: change root failed: no such file or directory\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled