NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.53k stars 274 forks source link

The latest nvidia-container-toolkit caused inconsistent cuda version and 804 error. #291

Open gemfield opened 3 years ago

gemfield commented 3 years ago

1. Issue or feature description

docker image gemfield/homepod:2.0-pro (Dockerfile: https://github.com/DeepVAC/MLab/blob/main/docker/homepod/Dockerfile.pro) installed official pytorch conda package on nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04.

With below nvidia-container-toolkit version:

gemfield@ai01:~$ dpkg -l | grep container | grep nvidia
ii  libnvidia-container-tools                     1.3.0-1                                        amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                    1.3.0-1                                        amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                      3.4.0-1                                        amd64        NVIDIA container runtime
ii  nvidia-container-toolkit                      1.3.0-1                                        amd64        NVIDIA container runtime hook

nvidia-smi shows cuda 11.2 version and torch.cuda.is_available() returns true in docker container.

When update nvidia-container-toolkit to:

gemfield@ai01:~$ dpkg -l | grep nvidia | grep container
ii  libnvidia-container-tools                 1.4.0-1                           amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                1.4.0-1                           amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                  3.5.0-1                           amd64        NVIDIA container runtime
ii  nvidia-container-toolkit                  1.5.1-1                           amd64        NVIDIA container runtime hook

nvidia-smi shows cuda 11.3 version and torch.cuda.is_available() throws Error 804: forward compatibility was attempted on non supported HW error in container.

Both senarios are based on same hardware (RTX1080ti) , OS(Ubuntu 20.04) and nvidia driver(460.80).

2. Steps to reproduce the issue

Use docker run -it --gpus all --rm gemfield/homepod:2.0-pro bash

3. Information

kernel:

gemfield@ai01:~$ uname -a
Linux ai01 5.4.0-77-generic NVIDIA/nvidia-docker#86-Ubuntu SMP Thu Jun 17 02:35:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

host driver:

gemfield@ai01:~$ nvidia-smi
Tue Jun 29 16:26:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 33%   33C    P8     9W / 250W |      6MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

docker version:

gemfield@ai01:~$ docker version
Client:
 Version:           20.10.2
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.2-0ubuntu1~20.04.2
 Built:             Tue Mar 30 21:24:57 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.2
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.2-0ubuntu1~20.04.2
  Built:            Mon Mar 29 19:10:09 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4-0ubuntu1~20.04.2
  GitCommit:        
 nvidia:
  Version:          1.0.0~rc95-0ubuntu1~20.04.1
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:

NVIDIA container library version:

gemfield@ai01:~$ nvidia-container-cli -V
version: 1.4.0
build date: 2021-04-24T14:25+00:00
build revision: 704a698b7a0ceec07a48e56c37365c741718c2df
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

related issue: https://github.com/DeepVAC/MLab/issues/43

So,

  1. is this behaviour expected or it is a bug?
  2. if expected, what should I do to fix this issue in gemfield/homepod:2.0-pro (Dockerfile: https://github.com/DeepVAC/MLab/blob/main/docker/homepod/Dockerfile.pro) ?
elezar commented 3 years ago

Thanks for reporting this @gemfield. I don't think this is expected behaviour but I will look into it.

Would it be possible to attach a debug log when launching a container: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#generating-debugging-logs

gemfield commented 3 years ago

@elezar Here is the log with nvidia-container-toolkit==1.5.1-1 and libnvidia-container-tools==1.4.0-1:

gemfield@ai01:/gemfield/hostpv$ cat /var/log/nvidia-container-toolkit.log

-- WARNING, the following logs are for debugging purposes only --

I0629 12:10:31.001378 930305 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I0629 12:10:31.001469 930305 nvc.c:346] using root /
I0629 12:10:31.001483 930305 nvc.c:347] using ldcache /etc/ld.so.cache
I0629 12:10:31.001494 930305 nvc.c:348] using unprivileged user 65534:65534
I0629 12:10:31.001523 930305 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0629 12:10:31.001717 930305 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I0629 12:10:31.003740 930312 nvc.c:274] loading kernel module nvidia
I0629 12:10:31.004043 930312 nvc.c:278] running mknod for /dev/nvidiactl
I0629 12:10:31.004094 930312 nvc.c:282] running mknod for /dev/nvidia0
I0629 12:10:31.004129 930312 nvc.c:282] running mknod for /dev/nvidia1
I0629 12:10:31.004162 930312 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I0629 12:10:31.012668 930312 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0629 12:10:31.012820 930312 nvc.c:214] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0629 12:10:31.015665 930312 nvc.c:292] loading kernel module nvidia_uvm
I0629 12:10:31.015790 930312 nvc.c:296] running mknod for /dev/nvidia-uvm
I0629 12:10:31.015888 930312 nvc.c:301] loading kernel module nvidia_modeset
I0629 12:10:31.016016 930312 nvc.c:305] running mknod for /dev/nvidia-modeset
I0629 12:10:31.016283 930313 driver.c:101] starting driver service
I0629 12:10:31.020636 930305 nvc_container.c:388] configuring container with 'compute utility supervised'
I0629 12:10:31.021014 930305 nvc_container.c:236] selecting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libcuda.so.465.19.01
I0629 12:10:31.021113 930305 nvc_container.c:236] selecting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libnvidia-ptxjitcompiler.so.465.19.01
I0629 12:10:31.021292 930305 nvc_container.c:408] setting pid to 930291
I0629 12:10:31.021309 930305 nvc_container.c:409] setting rootfs to /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged
I0629 12:10:31.021322 930305 nvc_container.c:410] setting owner to 0:0
I0629 12:10:31.021335 930305 nvc_container.c:411] setting bins directory to /usr/bin
I0629 12:10:31.021348 930305 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu
I0629 12:10:31.021360 930305 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu
I0629 12:10:31.021373 930305 nvc_container.c:414] setting cudart directory to /usr/local/cuda
I0629 12:10:31.021385 930305 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0629 12:10:31.021398 930305 nvc_container.c:416] setting mount namespace to /proc/930291/ns/mnt
I0629 12:10:31.021411 930305 nvc_container.c:418] setting devices cgroup to /sys/fs/cgroup/devices/docker/796b3d7686d9596acb54485478b6126d1429acaa5607e9a8a47d17eb004545ab
I0629 12:10:31.021430 930305 nvc_info.c:676] requesting driver information with ''
I0629 12:10:31.023521 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.80
I0629 12:10:31.023615 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.80
I0629 12:10:31.023691 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.80
I0629 12:10:31.023770 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.80
I0629 12:10:31.023870 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.80
I0629 12:10:31.023970 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.80
I0629 12:10:31.024044 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.80
I0629 12:10:31.024127 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.80
I0629 12:10:31.024228 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.80
I0629 12:10:31.024330 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.80
I0629 12:10:31.024401 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.80
I0629 12:10:31.024470 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.80
I0629 12:10:31.024539 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.80
I0629 12:10:31.024641 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.80
I0629 12:10:31.024743 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.80
I0629 12:10:31.024813 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.80
I0629 12:10:31.024883 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.80
I0629 12:10:31.024982 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.80
I0629 12:10:31.025053 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.80
I0629 12:10:31.025155 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.80
I0629 12:10:31.025465 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.80
I0629 12:10:31.025645 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.80
I0629 12:10:31.025716 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.80
I0629 12:10:31.025789 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.80
I0629 12:10:31.025867 930305 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.80
W0629 12:10:31.025957 930305 nvc_info.c:350] missing library libnvidia-nscq.so
W0629 12:10:31.025977 930305 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0629 12:10:31.025990 930305 nvc_info.c:350] missing library libvdpau_nvidia.so
W0629 12:10:31.026002 930305 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0629 12:10:31.026015 930305 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0629 12:10:31.026027 930305 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0629 12:10:31.026040 930305 nvc_info.c:354] missing compat32 library libcuda.so
W0629 12:10:31.026052 930305 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0629 12:10:31.026065 930305 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0629 12:10:31.026077 930305 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0629 12:10:31.026090 930305 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0629 12:10:31.026102 930305 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0629 12:10:31.026114 930305 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0629 12:10:31.026127 930305 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0629 12:10:31.026139 930305 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0629 12:10:31.026151 930305 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0629 12:10:31.026164 930305 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0629 12:10:31.026176 930305 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0629 12:10:31.026189 930305 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0629 12:10:31.026201 930305 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0629 12:10:31.026213 930305 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0629 12:10:31.026226 930305 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0629 12:10:31.026238 930305 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0629 12:10:31.026250 930305 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0629 12:10:31.026263 930305 nvc_info.c:354] missing compat32 library libnvoptix.so
W0629 12:10:31.026283 930305 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0629 12:10:31.026295 930305 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0629 12:10:31.026308 930305 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0629 12:10:31.026320 930305 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0629 12:10:31.026333 930305 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0629 12:10:31.026345 930305 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0629 12:10:31.026804 930305 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0629 12:10:31.026840 930305 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0629 12:10:31.026874 930305 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0629 12:10:31.026929 930305 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0629 12:10:31.026964 930305 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
W0629 12:10:31.027162 930305 nvc_info.c:376] missing binary nv-fabricmanager
I0629 12:10:31.027207 930305 nvc_info.c:438] listing device /dev/nvidiactl
I0629 12:10:31.027219 930305 nvc_info.c:438] listing device /dev/nvidia-uvm
I0629 12:10:31.027230 930305 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0629 12:10:31.027241 930305 nvc_info.c:438] listing device /dev/nvidia-modeset
I0629 12:10:31.027286 930305 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0629 12:10:31.027326 930305 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0629 12:10:31.027356 930305 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0629 12:10:31.027369 930305 nvc_info.c:733] requesting device information with ''
I0629 12:10:31.033849 930305 nvc_info.c:623] listing device /dev/nvidia0 (GPU-66be7464-6dee-a588-1ad6-b95e626f907b at 00000000:4b:00.0)
I0629 12:10:31.040214 930305 nvc_info.c:623] listing device /dev/nvidia1 (GPU-a92e2821-9637-a638-a485-4cb36f5f3ee1 at 00000000:4c:00.0)
I0629 12:10:31.040327 930305 nvc_mount.c:344] mounting tmpfs at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/proc/driver/nvidia
I0629 12:10:31.040943 930305 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/bin/nvidia-smi
I0629 12:10:31.041067 930305 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/bin/nvidia-debugdump
I0629 12:10:31.041161 930305 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/bin/nvidia-persistenced
I0629 12:10:31.041265 930305 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/bin/nvidia-cuda-mps-control
I0629 12:10:31.041364 930305 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/bin/nvidia-cuda-mps-server
I0629 12:10:31.041657 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.80
I0629 12:10:31.041763 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.80
I0629 12:10:31.041860 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libcuda.so.460.80
I0629 12:10:31.041997 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.80
I0629 12:10:31.042114 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.80
I0629 12:10:31.042221 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.80
I0629 12:10:31.042320 930305 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.80 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.80
I0629 12:10:31.042372 930305 nvc_mount.c:524] creating symlink /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I0629 12:10:31.042580 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libcuda.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libcuda.so.465.19.01
I0629 12:10:31.042688 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libnvidia-ptxjitcompiler.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01
I0629 12:10:31.042930 930305 nvc_mount.c:239] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/run/nvidia-persistenced/socket
I0629 12:10:31.043039 930305 nvc_mount.c:208] mounting /dev/nvidiactl at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/dev/nvidiactl
I0629 12:10:31.043088 930305 nvc_mount.c:499] whitelisting device node 195:255
I0629 12:10:31.043190 930305 nvc_mount.c:208] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/dev/nvidia-uvm
I0629 12:10:31.043232 930305 nvc_mount.c:499] whitelisting device node 235:0
I0629 12:10:31.043318 930305 nvc_mount.c:208] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/dev/nvidia-uvm-tools
I0629 12:10:31.043359 930305 nvc_mount.c:499] whitelisting device node 235:1
I0629 12:10:31.043469 930305 nvc_mount.c:208] mounting /dev/nvidia0 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/dev/nvidia0
I0629 12:10:31.043630 930305 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:4b:00.0 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/proc/driver/nvidia/gpus/0000:4b:00.0
I0629 12:10:31.043676 930305 nvc_mount.c:499] whitelisting device node 195:0
I0629 12:10:31.043778 930305 nvc_mount.c:208] mounting /dev/nvidia1 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/dev/nvidia1
I0629 12:10:31.043921 930305 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:4c:00.0 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/proc/driver/nvidia/gpus/0000:4c:00.0
I0629 12:10:31.043963 930305 nvc_mount.c:499] whitelisting device node 195:1
I0629 12:10:31.044003 930305 nvc_ldcache.c:360] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged
I0629 12:10:31.170223 930305 nvc.c:423] shutting down library context
I0629 12:10:31.171158 930313 driver.c:163] terminating driver service
I0629 12:10:31.171814 930305 driver.c:203] driver service terminated successfully

Thanks.

gemfield commented 3 years ago

With nvidia-container-toolkit == 1.3.0-1:

gemfield@ai02:~$ cat /var/log/nvidia-container-toolkit.log                   

-- WARNING, the following logs are for debugging purposes only --

I0629 12:23:01.770400 601381 nvc.c:282] initializing library context (version=1.3.0, build=16315ebdf4b9728e899f615e208b50c41d7a5d15)
I0629 12:23:01.770499 601381 nvc.c:256] using root /
I0629 12:23:01.770515 601381 nvc.c:257] using ldcache /etc/ld.so.cache
I0629 12:23:01.770529 601381 nvc.c:258] using unprivileged user 65534:65534
I0629 12:23:01.770559 601381 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0629 12:23:01.770769 601381 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I0629 12:23:01.773340 601386 nvc.c:192] loading kernel module nvidia
I0629 12:23:01.773805 601386 nvc.c:204] loading kernel module nvidia_uvm
I0629 12:23:01.774102 601386 nvc.c:212] loading kernel module nvidia_modeset
I0629 12:23:01.774603 601387 driver.c:101] starting driver service
I0629 12:23:01.778819 601381 nvc_container.c:364] configuring container with 'compute utility supervised'
I0629 12:23:01.783208 601381 nvc_container.c:384] setting pid to 601370
I0629 12:23:01.783241 601381 nvc_container.c:385] setting rootfs to /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged
I0629 12:23:01.783255 601381 nvc_container.c:386] setting owner to 0:0
I0629 12:23:01.783269 601381 nvc_container.c:387] setting bins directory to /usr/bin
I0629 12:23:01.783283 601381 nvc_container.c:388] setting libs directory to /usr/lib/x86_64-linux-gnu
I0629 12:23:01.783296 601381 nvc_container.c:389] setting libs32 directory to /usr/lib/i386-linux-gnu
I0629 12:23:01.783309 601381 nvc_container.c:390] setting cudart directory to /usr/local/cuda
I0629 12:23:01.783323 601381 nvc_container.c:391] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0629 12:23:01.783336 601381 nvc_container.c:392] setting mount namespace to /proc/601370/ns/mnt
I0629 12:23:01.783350 601381 nvc_container.c:394] setting devices cgroup to /sys/fs/cgroup/devices/docker/b2530c09ad0e06e16d67a31d5962bf2853f41789b19b6adc7dde5161ad2e4f34
I0629 12:23:01.783372 601381 nvc_info.c:680] requesting driver information with ''
I0629 12:23:01.786508 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.460.80
I0629 12:23:01.786779 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.80
I0629 12:23:01.787260 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.80
I0629 12:23:01.787783 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.80
I0629 12:23:01.788337 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.460.80
I0629 12:23:01.788867 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.80
I0629 12:23:01.789353 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.460.80
I0629 12:23:01.789439 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.80
I0629 12:23:01.790034 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.460.80
I0629 12:23:01.790557 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.80
I0629 12:23:01.791036 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.80
I0629 12:23:01.791512 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.80
I0629 12:23:01.792037 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.460.80
I0629 12:23:01.792531 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.460.80
I0629 12:23:01.793060 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.80
I0629 12:23:01.793661 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.80
I0629 12:23:01.793753 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.80
I0629 12:23:01.794303 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.80
I0629 12:23:01.794748 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.80
I0629 12:23:01.795268 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.460.80
I0629 12:23:01.795581 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.460.80
I0629 12:23:01.795865 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.460.80
I0629 12:23:01.796398 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.460.80
I0629 12:23:01.796788 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.460.80
I0629 12:23:01.797313 601381 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.460.80
W0629 12:23:01.797369 601381 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0629 12:23:01.797386 601381 nvc_info.c:350] missing library libvdpau_nvidia.so
W0629 12:23:01.797399 601381 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0629 12:23:01.797413 601381 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0629 12:23:01.797427 601381 nvc_info.c:354] missing compat32 library libcuda.so
W0629 12:23:01.797440 601381 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0629 12:23:01.797454 601381 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0629 12:23:01.797467 601381 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0629 12:23:01.797481 601381 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0629 12:23:01.797494 601381 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0629 12:23:01.797508 601381 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0629 12:23:01.797521 601381 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0629 12:23:01.797535 601381 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0629 12:23:01.797548 601381 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0629 12:23:01.797562 601381 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0629 12:23:01.797575 601381 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0629 12:23:01.797587 601381 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0629 12:23:01.797600 601381 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0629 12:23:01.797614 601381 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0629 12:23:01.797627 601381 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0629 12:23:01.797641 601381 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0629 12:23:01.797654 601381 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0629 12:23:01.797667 601381 nvc_info.c:354] missing compat32 library libnvoptix.so
W0629 12:23:01.797681 601381 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0629 12:23:01.797694 601381 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0629 12:23:01.797707 601381 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0629 12:23:01.797720 601381 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0629 12:23:01.797733 601381 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0629 12:23:01.797746 601381 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0629 12:23:01.798199 601381 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0629 12:23:01.798240 601381 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0629 12:23:01.798278 601381 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0629 12:23:01.798317 601381 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0629 12:23:01.798355 601381 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
I0629 12:23:01.798409 601381 nvc_info.c:438] listing device /dev/nvidiactl
I0629 12:23:01.798424 601381 nvc_info.c:438] listing device /dev/nvidia-uvm
I0629 12:23:01.798437 601381 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0629 12:23:01.798451 601381 nvc_info.c:438] listing device /dev/nvidia-modeset
I0629 12:23:01.798503 601381 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0629 12:23:01.798550 601381 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0629 12:23:01.798565 601381 nvc_info.c:745] requesting device information with ''
I0629 12:23:01.811183 601381 nvc_info.c:628] listing device /dev/nvidia1 (GPU-22515840-d7f0-d7cb-f55c-8a8c5b159e24 at 00000000:01:00.0)
I0629 12:23:01.817646 601381 nvc_info.c:628] listing device /dev/nvidia0 (GPU-7acfc40f-92a6-2d54-2cd5-87db3b17dab7 at 00000000:02:00.0)
I0629 12:23:01.817808 601381 nvc_mount.c:344] mounting tmpfs at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/proc/driver/nvidia
I0629 12:23:01.820007 601381 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/bin/nvidia-smi
I0629 12:23:01.820375 601381 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/bin/nvidia-debugdump
I0629 12:23:01.821187 601381 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/bin/nvidia-persistenced
I0629 12:23:01.821317 601381 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/bin/nvidia-cuda-mps-control
I0629 12:23:01.821427 601381 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/bin/nvidia-cuda-mps-server
I0629 12:23:01.822969 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.80
I0629 12:23:01.823235 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.460.80
I0629 12:23:01.823468 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libcuda.so.460.80
I0629 12:23:01.823720 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.460.80
I0629 12:23:01.823949 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.80
I0629 12:23:01.824179 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.460.80
I0629 12:23:01.824408 601381 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.80 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.460.80
I0629 12:23:01.824462 601381 nvc_mount.c:524] creating symlink /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I0629 12:23:01.826289 601381 nvc_mount.c:239] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/run/nvidia-persistenced/socket
I0629 12:23:01.826395 601381 nvc_mount.c:208] mounting /dev/nvidiactl at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/dev/nvidiactl
I0629 12:23:01.826457 601381 nvc_mount.c:499] whitelisting device node 195:255
I0629 12:23:01.826548 601381 nvc_mount.c:208] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/dev/nvidia-uvm
I0629 12:23:01.826586 601381 nvc_mount.c:499] whitelisting device node 236:0
I0629 12:23:01.826663 601381 nvc_mount.c:208] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/dev/nvidia-uvm-tools
I0629 12:23:01.826700 601381 nvc_mount.c:499] whitelisting device node 236:1
I0629 12:23:01.826798 601381 nvc_mount.c:208] mounting /dev/nvidia1 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/dev/nvidia1
I0629 12:23:01.826939 601381 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0629 12:23:01.826980 601381 nvc_mount.c:499] whitelisting device node 195:1
I0629 12:23:01.827066 601381 nvc_mount.c:208] mounting /dev/nvidia0 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/dev/nvidia0
I0629 12:23:01.827193 601381 nvc_mount.c:412] mounting /proc/driver/nvidia/gpus/0000:02:00.0 at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged/proc/driver/nvidia/gpus/0000:02:00.0
I0629 12:23:01.827234 601381 nvc_mount.c:499] whitelisting device node 195:0
I0629 12:23:01.827284 601381 nvc_ldcache.c:359] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/0e4d7573777b632e390be3c3761776672ae75356d029ef5bef711a46ced8884b/merged
I0629 12:23:02.789252 601381 nvc.c:337] shutting down library context
I0629 12:23:02.813757 601387 driver.c:156] terminating driver service
I0629 12:23:02.814441 601381 driver.c:196] driver service terminated successfully
gemfield commented 3 years ago

Seems the new version nvidia-container-toolkit has extra mounting:

I0629 12:10:31.042580 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libcuda.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libcuda.so.465.19.01
I0629 12:10:31.042688 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libnvidia-ptxjitcompiler.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01
klueska commented 3 years ago

A bug was fixed in v1.3.1 where the forward compatibility libraries were not being injected properly into a container: https://github.com/NVIDIA/libnvidia-container/commit/5ae7360cabd34eec7d7dde6369bd7a87403b839d

It seems that your setup somehow relied on this bug not injecting these libraries in order to work correctly (which is strange, I would have expected it to error out with a different error).

Now that the forward compatibility libraries are being injected, you are getting the error:

Error 804: forward compatibility was attempted on non supported HW

Because forward-compatibility is not supported on RTX1080ti (only Tesla cards are supported as mentioned here): https://docs.nvidia.com/deploy/cuda-compatibility/index.html#supported-gpus

Are you sure you were using a host with 11.2 installed and a container with 11.3 installed when running with libnvidia-container-1.3.0? That alone should have caused its own problems once you started trying to actually use CUDA (though maybe the simple torch.cuda.is_available() call would have still passed in this setup).

elezar commented 3 years ago

Thanks for the links @klueska. I missed that forward compatibility is not supported by non-Tesla devices. The images mentioned do contain the 11.3 compat libs, and their use is confirmed by the 1.5.1 logs that were provided.

gemfield commented 3 years ago

To reproduce this issue, I use two host machines(ai01, ai02) with different nvidia-container-toolkit version. And, both ai01 and ai02 have same OS, nvidia driver, and 1080ti cuda device.

host

# ai01 (nvidia-container-toolkit==1.5.1-1)
gemfield@ai01:~$ nvidia-smi
Wed Jun 30 00:15:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 33%   34C    P8     9W / 250W |      6MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

# ai02 (nvidia-container-toolkit == 1.3.0-1)
gemfield@ai02:~$ nvidia-smi
Wed Jun 30 00:03:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 66%   83C    P2   237W / 250W |  10491MiB / 11177MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

start the container

run below commands on ai01 and ai02 respectively:

gemfield@ai01:~$ docker run -it --rm --gpus all gemfield/homepod:2.0-pro bash
gemfield@ai02:~$ docker run -it --rm --gpus all gemfield/homepod:2.0-pro bash

compile and run test code

gemfield.cpp:

#include <stdio.h>
#include <cuda_runtime.h>
int main() {
    int device = 0;
    int gpuDeviceCount = 0;
    struct cudaDeviceProp properties;

    cudaError_t cudaResultCode = cudaGetDeviceCount(&gpuDeviceCount);

    if (cudaResultCode == cudaSuccess){
        cudaGetDeviceProperties(&properties, device);
        printf("%d GPU CUDA devices(s)(%d)\n", gpuDeviceCount, properties.major);
        printf("\t Product Name: %s\n"          , properties.name);
        printf("\t TotalGlobalMem: %ld MB\n"    , properties.totalGlobalMem/(1024^2));
        printf("\t GPU Count: %d\n"             , properties.multiProcessorCount);
        printf("\t Kernels found: %d\n"         , properties.concurrentKernels);
        return 0;
    }
    printf("\t gemfield error: %d\n",cudaResultCode);

compile this code in ai01 container and ai02 container respectively:

#ai01 container
root@1d0d6b4ec38d:/.gemfield_install# g++ -I/usr/local/cuda-11.3/targets/x86_64-linux/include/ gemfield.cpp -o gemfield -L/usr/local/cuda-11.3/targets/x86_64-linux/lib/ -lcudart
#ai02 container
root@9b438063576f:/.gemfield_install# g++ -I/usr/local/cuda-11.3/targets/x86_64-linux/include/ gemfield.cpp -o gemfield -L/usr/local/cuda-11.3/targets/x86_64-linux/lib/ -lcudart

run the executable file respectively:

#in ai01 container
root@1d0d6b4ec38d:/.gemfield_install# ./gemfield 
     gemfield error: 804

#in ai02 container
root@9b438063576f:/.gemfield_install# ./gemfield 
2 GPU CUDA devices(s)(6)
     Product Name: GeForce GTX 1080 Ti
     TotalGlobalMem: 11423129 MB
     GPU Count: 28
     Kernels found: 1

nvidia-smi in containers

#in ai01 container
root@1d0d6b4ec38d:/.gemfield_install# nvidia-smi
Wed Jun 30 00:35:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 33%   34C    P8     9W / 250W |      6MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
#in ai02 container
root@9b438063576f:/.gemfield_install# nvidia-smi
Wed Jun 30 00:35:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 67%   78C    P2    82W / 250W |  10491MiB / 11177MiB |      0%      Default |
|                               |                      |                  N/A |
gemfield commented 3 years ago

@klueska are you saying ai02 container "worked" just because nvidia-container-toolkit == 1.3.0-1 accidentally contains the bug? And by "worked", it should throw another issue instead? But why below code works in ai02 container?

import torch

x = torch.randn((1, 3, 5, 5), device="cuda")
# print x
tensor([[[[ 1.2785,  0.1989, -0.1807,  0.0724, -0.0832],
          [-0.8234, -0.7329, -0.1798, -0.7472, -1.0487],
          [-0.6873,  0.9794,  1.6386,  2.3355, -0.7024],
          [ 0.4769,  0.2187, -2.1455,  0.1791,  0.1882],
          [ 1.3050,  0.9821, -0.3730, -1.4119, -1.4276]],

         [[ 0.7505, -0.6974,  0.5616, -0.9638,  0.7917],
          [-0.8177,  1.3899, -1.6273, -0.0344,  0.2316],
          [ 0.1388,  0.2799,  2.2887, -1.0132, -0.0351],
          [-0.9957, -0.6478,  0.2437,  0.1609,  0.4821],
          [ 0.1865, -0.4291,  0.1917,  1.5221,  0.6073]],

         [[-2.6149, -2.3429,  0.2260,  1.6618, -0.4434],
          [ 1.3010, -0.1140,  0.8159, -0.2531, -0.5990],
          [-1.3234, -0.2465, -1.3396, -0.3108, -1.0227],
          [-0.4442,  0.7686,  0.7335, -0.7267, -1.5234],
          [-1.5750, -0.1051, -0.4798, -0.9625,  0.1764]]]], device='cuda:0')
# add ops
x += 1
# print x
tensor([[[[ 2.2785,  1.1989,  0.8193,  1.0724,  0.9168],
          [ 0.1766,  0.2671,  0.8202,  0.2528, -0.0487],
          [ 0.3127,  1.9794,  2.6386,  3.3355,  0.2976],
          [ 1.4769,  1.2187, -1.1455,  1.1791,  1.1882],
          [ 2.3050,  1.9821,  0.6270, -0.4119, -0.4276]],

         [[ 1.7505,  0.3026,  1.5616,  0.0362,  1.7917],
          [ 0.1823,  2.3899, -0.6273,  0.9656,  1.2316],
          [ 1.1388,  1.2799,  3.2887, -0.0132,  0.9649],
          [ 0.0043,  0.3522,  1.2437,  1.1609,  1.4821],
          [ 1.1865,  0.5709,  1.1917,  2.5221,  1.6073]],

         [[-1.6149, -1.3429,  1.2260,  2.6618,  0.5566],
          [ 2.3010,  0.8860,  1.8159,  0.7469,  0.4010],
          [-0.3234,  0.7535, -0.3396,  0.6892, -0.0227],
          [ 0.5558,  1.7686,  1.7335,  0.2733, -0.5234],
          [-0.5750,  0.8949,  0.5202,  0.0375,  1.1764]]]], device='cuda:0')
gemfield commented 3 years ago

After further debugging(https://zhuanlan.zhihu.com/p/361545761), I found that:

ai01 container:

root@1d0d6b4ec38d:/.gemfield_install# ls -l /usr/lib/x86_64-linux-gnu/libcuda.so*
lrwxrwxrwx 1 root root       12 6月  30 00:09 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 6月  30 00:09 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.465.19.01
-rw-r--r-- 1 root root 21795104 5月   7 15:00 /usr/lib/x86_64-linux-gnu/libcuda.so.460.80
-rw-r--r-- 1 root root 22033824 3月  19 16:07 /usr/lib/x86_64-linux-gnu/libcuda.so.465.19.01

ai02 container:

root@9b438063576f:/.gemfield_install# ls -l /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root       12 6月  30 00:05 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       17 6月  30 00:05 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.460.80
-rw-r--r-- 1 root root 21795104 5月   7 15:00 /usr/lib/x86_64-linux-gnu/libcuda.so.460.80

With both host ai01 and host ai02 installed libcuda.so.460.80 cuda driver, ai01 container now has the CUDA Forward Compatible issue. While for ai02 container, it becomes to "newer version cuda toolkit + older version cuda driver + older version kernel mode GPU driver", which belongs to the category that we called "CUDA enhanced compatibility":

# quote from https://docs.nvidia.com/deploy/cuda-compatibility/index.html#faq
What about new features introduced in minor releases of CUDA? How does a developer build an application using newer CUDA Toolkits (e.g. 11.x) work on a system with a CUDA 11.0 driver (R450)?
By using new CUDA versions, users can benefit from new CUDA programming model APIs, compiler optimizations and math library features. There are some caveats:
A subset of CUDA APIs don’t need a new driver and they can all be used without any driver dependencies. For example, async copy APIs introduced in 11.1 do not need a new driver.
To use other CUDA APIs introduced in a minor release (that require a new driver), one would have to implement fallbacks or fail gracefully. This situation is not different from what is available today where developers use macros to compile out features based on CUDA versions. Users should refer to the CUDA headers and documentation for new CUDA APIs introduced in a release.

So, to eliminate the ai01 container 804 error, I just relink libcuda.so.1 to libcuda.so.460.80, which turns the CUDA Forward Compatible issue(not supported by non-tesla devices) to "CUDA enhanced compatibility"(supported by tesla and non-tesla devices):

# on ai01 container
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so.460.80 /usr/lib/x86_64-linux-gnu/libcuda.so.1

@elezar Could you confirm that if this is the right way? Thanks.

klueska commented 3 years ago

The "CUDA enhanced compatibility" guarantee only supports running newer CUDA versions on drivers that support the same major CUDA version, but not on drivers that support older major CUDA versions. I.e. it guarantees that an application built with CUDA 11.2 will run on a driver with libcuda.so for CUDA 11.2, 11.1 or 11.0 installed, but does not guarantee it will run on a driver with libcuda for CUDA 10.x installed.

Since your use case falls under the supported scenario (driver for 11.2 and container with 11.3, I would actually expect things to work without doing anything special).

I think what is going on is that you have the forward-compatibility libraries installed on your host, which is causing them to be injected into the container, even though the GPU hardware you have installed on the machine doesnt support forward-compatibility. Ideally, libnvidia-container would be smart enough to detect that the forward compat libs should not be injected if the underlying hardware does not support them, but (if I remember correctly) it is not.

Can you check if you have a package installed on your host called cuda-compat-11-3 (or similar) and if so, remove it. That should remove the forward compat libs from your host, preventing them from being injected into the container at runtime.

gemfield commented 3 years ago

@klueska There has no cuda-compat packages on ai01 host OS(ubuntu20.04)

gemfield@ai01:~$ dpkg -l | grep -i cuda | grep -i compat
gemfield@ai01:~$ dpkg -l | grep -i nvidia | grep -i compat
gemfield@ai01:~$ find /usr -name libcuda.so.465*

From the debug log:

I0629 12:10:31.042580 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libcuda.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libcuda.so.465.19.01
I0629 12:10:31.042688 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libnvidia-ptxjitcompiler.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01

it seems that the cuda-compat library is injected to /usr/lib/x86_64-linux-gnu/ from image's another directory(/usr/local/cuda-11.3/compat/), rather than from ai01 host.

elezar commented 3 years ago

With regards to:

I think what is going on is that you have the forward-compatibility libraries installed on your host, which is causing them to be injected into the container, even though the GPU hardware you have installed on the machine doesnt support forward-compatibility. Ideally, libnvidia-container would be smart enough to detect that the forward compat libs should not be injected if the underlying hardware does not support them, but (if I remember correctly) it is not.

From the log messages it seems as if these forward-compatibility libraries are present in the container being used (hence them being mounted from the docker FS root to the docker root):

I0629 12:10:31.042580 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libcuda.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libcuda.so.465.19.01
I0629 12:10:31.042688 930305 nvc_mount.c:112] mounting /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/local/cuda-11.3/compat/libnvidia-ptxjitcompiler.so.465.19.01 at /var/lib/docker/overlay2/dd8d1c44a88df34c3257d7d6cc323c206a57a70abb108ebc389456002466b76b/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.465.19.01

Running a container based on the image and checking that path confirms that these are present there. (on my mac with no CUDA GPU):

~ docker run --rm -ti nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 ls -alt /usr/local/cuda-11.3/compat/
total 32056
drwxr-xr-x 1 root root     4096 Jul  2 04:00 ..
drwxr-xr-x 2 root root     4096 Jul  2 03:55 .
lrwxrwxrwx 1 root root       12 Mar 19 12:05 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Mar 19 12:05 libcuda.so.1 -> libcuda.so.465.19.01
lrwxrwxrwx 1 root root       37 Mar 19 12:05 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.465.19.01
-rw-r--r-- 1 root root 22033824 Mar 19 08:07 libcuda.so.465.19.01
-rw-r--r-- 1 root root 10779128 Mar 19 07:56 libnvidia-ptxjitcompiler.so.465.19.01

@klueska thinking about this now, is this expected behaviour? Do we expect an image vendor to ship the forward-compatibility libraries in the container, or should we ALWAYS be mounting them from the host (if present)?

klueska commented 3 years ago

@elezar I think it would be OK (possibly even preferable) to bundle them in the container image so long as libnvidia-container was smart enough to only re-mount them on hardware that was compatible with them.

klueska commented 3 years ago

So ultimately this does seem like a bug (or at least limitation) of libnvidia-container.

klueska commented 3 years ago

For this specific use-case (since you are building your own docker image to wrap nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04), I would suggest removing the compat package from the container image.

gemfield commented 3 years ago

@klueska Thanks, I will have a try. Meantime, will this be fixed/enhanced in libnvidia-container next release?

guillaumekln commented 3 years ago

Hi,

Is this the same issue as https://github.com/NVIDIA/libnvidia-container/issues/138? I confirm that removing the package cuda-compat-X is a way to workaround the issue.

gemfield commented 3 years ago

@guillaumekln I think so. And I have already used same wa as solution of MLab HomePod project: https://github.com/DeepVAC/MLab/blob/6479b74dcb9fe3d598658f41f6f1c6dec7fd71a4/docker/homepod/Dockerfile.pro#L9