NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.29k stars 246 forks source link

Nvidia/CUDA Docker containers nested inside LXD container fail if running the LXD container as unprivileged (OK, if privileged) #290

Open waldekkot opened 3 years ago

waldekkot commented 3 years ago

1. Issue or feature description

I am trying to run a Nvidia/CUDA Docker container from within an LXD container (so, a nested scenario). It seems, the only way to get such Nvidia Docker container working is to make the LXD container a privileged one. So, inside the privileged LXD container, the following works perfectly fine:

docker run --rm --gpus all --ipc=host nvidia/cuda:11.4.1-base-ubuntu20.04 nvidia-smi

If I run the very same LXD container as unprivileged, the nested CUDA Docker container fails with the error below. Other (non nvidia/CUDA) Docker containers work fine.

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: write error: /sys/fs/cgroup/devices/docker/098ad8bf1fdcf4ab72091864933fbc8b67a8f0b30746681ba6ef4082c23245b9/devices.allow: operation not permitted: unknown.

On the LXD discussion group, it was suggested to make the error as "non-fatal" in case of nested containers: https://discuss.linuxcontainers.org/t/nvidia-and-docker-in-lxd/12136

2. Steps to reproduce the issue

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

- install Docker inside the LXD container:

apt install apt-transport-https ca-certificates curl gnupg lsb-release -y curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null apt-get update apt-get install docker-ce docker-ce-cli containerd.io

- "typical" Docker containers work perfectly fine:

docker run --rm hello-world

Hello from Docker! This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:

  1. The Docker client contacted the Docker daemon.
  2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64)
  3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
  4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/

For more examples and ideas, visit: https://docs.docker.com/get-started/

- install nvidia-docker2

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - distribution=ubuntu20.04 && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list apt update apt install nvidia-docker2 -y

- restart Docker daemon and verify Docker working fine

systemctl restart docker

docker version

Client: Docker Engine - Community Version: 20.10.8 API version: 1.41 Go version: go1.16.6 Git commit: 3967b7d Built: Fri Jul 30 19:53:57 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.8 API version: 1.41 (minimum version 1.12) Go version: go1.16.6 Git commit: 75249d8 Built: Fri Jul 30 19:52:06 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.9 GitCommit: e25210fe30a0a703442421b0f60afac609f950a3 runc: Version: 1.0.1 GitCommit: v1.0.1-0-g4144b63 docker-init: Version: 0.19.0 GitCommit: de40ad0

docker info

Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Build with BuildKit (Docker Inc., v0.6.1-docker) scan: Docker Scan (Docker Inc., v0.8.0)

Server: Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 1 Server Version: 20.10.8 Storage Driver: btrfs Build Version: Btrfs v5.10.1 Library Version: 102 Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc Default Runtime: runc Init Binary: docker-init containerd version: e25210fe30a0a703442421b0f60afac609f950a3 runc version: v1.0.1-0-g4144b63 init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 5.11.0-34-generic Operating System: Ubuntu 21.04 OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 62.75GiB Name: demo4 ID: PTI6:4Q7T:PMWD:XC2L:5W2X:PMLV:7QRG:S3ZW:KMII:GCAY:PC7L:5P3X Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

- Nvidia/CUDA Docker containers fail in an unprivileged LXD container, e.g.:

docker run --rm --gpus all --ipc=host nvidia/cuda:11.4.1-base-ubuntu20.04 nvidia-smi

Unable to find image 'nvidia/cuda:11.4.1-base-ubuntu20.04' locally 11.4.1-base-ubuntu20.04: Pulling from nvidia/cuda 16ec32c2132b: Pull complete d795373d028a: Pull complete aa1a4de63ca7: Pull complete 99fe2b653f7a: Pull complete 151e201e5dbc: Pull complete Digest: sha256:79b4fdc93e6e98fbb1770893b497d6528ab19cf056d15e366787135ca18b7565 Status: Downloaded newer image for nvidia/cuda:11.4.1-base-ubuntu20.04 docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: write error: /sys/fs/cgroup/devices/docker/333969e7089a6ca8b93c493b34741c8e17d8d6fb5acaa16031c4a8fb54814286/devices.allow: operation not permitted: unknown.

- making the LXD container privileged:

exit lxc stop demo3 lxc config set demo3 security.privileged=true lxc start demo3

- now the very same Nvidia Docker container runs fine within the privileged LXD container:

lxc exec demo3 -- bash docker run --rm --gpus all --ipc=host nvidia/cuda:11.4.1-base-ubuntu20.04 nvidia-smi

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 16% 28C P8 16W / 250W | 178MiB / 11175MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [X] Some nvidia-container information: `nvidia-container-cli -k -d /dev/tty info`

nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0913 20:19:32.954928 591 nvc.c:372] initializing library context (version=1.5.0, build=4699c1b8b4991b6d869ea403e109291653bb040b) I0913 20:19:32.955339 591 nvc.c:346] using root / I0913 20:19:32.955386 591 nvc.c:347] using ldcache /etc/ld.so.cache I0913 20:19:32.955422 591 nvc.c:348] using unprivileged user 65534:65534 I0913 20:19:32.955509 591 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0913 20:19:32.956299 591 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment W0913 20:19:32.956416 591 nvc.c:249] skipping kernel modules load due to user namespace I0913 20:19:32.956870 592 driver.c:101] starting driver service I0913 20:19:32.958866 591 nvc_info.c:750] requesting driver information with '' I0913 20:19:32.959672 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.63.01 I0913 20:19:32.959733 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.63.01 I0913 20:19:32.959775 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.63.01 I0913 20:19:32.959809 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.63.01 I0913 20:19:32.959857 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.63.01 I0913 20:19:32.959900 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.63.01 I0913 20:19:32.959931 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.63.01 I0913 20:19:32.959965 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.63.01 I0913 20:19:32.960007 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.63.01 I0913 20:19:32.960053 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.63.01 I0913 20:19:32.960081 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.63.01 I0913 20:19:32.960115 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.63.01 I0913 20:19:32.960144 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.63.01 I0913 20:19:32.960189 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.63.01 I0913 20:19:32.960231 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.63.01 I0913 20:19:32.960260 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.63.01 I0913 20:19:32.960293 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.63.01 I0913 20:19:32.960335 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.63.01 I0913 20:19:32.960368 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.63.01 I0913 20:19:32.960414 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.63.01 I0913 20:19:32.960497 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.63.01 I0913 20:19:32.960557 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.63.01 I0913 20:19:32.960590 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.63.01 I0913 20:19:32.960617 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.63.01 I0913 20:19:32.960645 591 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.63.01 W0913 20:19:32.960660 591 nvc_info.c:392] missing library libnvidia-nscq.so W0913 20:19:32.960665 591 nvc_info.c:392] missing library libnvidia-fatbinaryloader.so W0913 20:19:32.960669 591 nvc_info.c:392] missing library libvdpau_nvidia.so W0913 20:19:32.960674 591 nvc_info.c:396] missing compat32 library libnvidia-ml.so W0913 20:19:32.960678 591 nvc_info.c:396] missing compat32 library libnvidia-cfg.so W0913 20:19:32.960682 591 nvc_info.c:396] missing compat32 library libnvidia-nscq.so W0913 20:19:32.960687 591 nvc_info.c:396] missing compat32 library libcuda.so W0913 20:19:32.960691 591 nvc_info.c:396] missing compat32 library libnvidia-opencl.so W0913 20:19:32.960695 591 nvc_info.c:396] missing compat32 library libnvidia-ptxjitcompiler.so W0913 20:19:32.960700 591 nvc_info.c:396] missing compat32 library libnvidia-fatbinaryloader.so W0913 20:19:32.960704 591 nvc_info.c:396] missing compat32 library libnvidia-allocator.so W0913 20:19:32.960708 591 nvc_info.c:396] missing compat32 library libnvidia-compiler.so W0913 20:19:32.960714 591 nvc_info.c:396] missing compat32 library libnvidia-ngx.so W0913 20:19:32.960719 591 nvc_info.c:396] missing compat32 library libvdpau_nvidia.so W0913 20:19:32.960724 591 nvc_info.c:396] missing compat32 library libnvidia-encode.so W0913 20:19:32.960727 591 nvc_info.c:396] missing compat32 library libnvidia-opticalflow.so W0913 20:19:32.960731 591 nvc_info.c:396] missing compat32 library libnvcuvid.so W0913 20:19:32.960735 591 nvc_info.c:396] missing compat32 library libnvidia-eglcore.so W0913 20:19:32.960739 591 nvc_info.c:396] missing compat32 library libnvidia-glcore.so W0913 20:19:32.960744 591 nvc_info.c:396] missing compat32 library libnvidia-tls.so W0913 20:19:32.960749 591 nvc_info.c:396] missing compat32 library libnvidia-glsi.so W0913 20:19:32.960753 591 nvc_info.c:396] missing compat32 library libnvidia-fbc.so W0913 20:19:32.960757 591 nvc_info.c:396] missing compat32 library libnvidia-ifr.so W0913 20:19:32.960762 591 nvc_info.c:396] missing compat32 library libnvidia-rtcore.so W0913 20:19:32.960765 591 nvc_info.c:396] missing compat32 library libnvoptix.so W0913 20:19:32.960769 591 nvc_info.c:396] missing compat32 library libGLX_nvidia.so W0913 20:19:32.960773 591 nvc_info.c:396] missing compat32 library libEGL_nvidia.so W0913 20:19:32.960778 591 nvc_info.c:396] missing compat32 library libGLESv2_nvidia.so W0913 20:19:32.960783 591 nvc_info.c:396] missing compat32 library libGLESv1_CM_nvidia.so W0913 20:19:32.960788 591 nvc_info.c:396] missing compat32 library libnvidia-glvkspirv.so W0913 20:19:32.960792 591 nvc_info.c:396] missing compat32 library libnvidia-cbl.so I0913 20:19:32.961010 591 nvc_info.c:297] selecting /usr/bin/nvidia-smi I0913 20:19:32.961030 591 nvc_info.c:297] selecting /usr/bin/nvidia-debugdump I0913 20:19:32.961046 591 nvc_info.c:297] selecting /usr/bin/nvidia-persistenced I0913 20:19:32.961072 591 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-control I0913 20:19:32.961092 591 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-server W0913 20:19:32.961139 591 nvc_info.c:418] missing binary nv-fabricmanager I0913 20:19:32.961177 591 nvc_info.c:512] listing device /dev/nvidiactl I0913 20:19:32.961184 591 nvc_info.c:512] listing device /dev/nvidia-uvm I0913 20:19:32.961191 591 nvc_info.c:512] listing device /dev/nvidia-uvm-tools I0913 20:19:32.961196 591 nvc_info.c:512] listing device /dev/nvidia-modeset W0913 20:19:32.961223 591 nvc_info.c:342] missing ipc /var/run/nvidia-persistenced/socket W0913 20:19:32.961247 591 nvc_info.c:342] missing ipc /var/run/nvidia-fabricmanager/socket W0913 20:19:32.961264 591 nvc_info.c:342] missing ipc /tmp/nvidia-mps I0913 20:19:32.961270 591 nvc_info.c:805] requesting device information with '' I0913 20:19:32.966964 591 nvc_info.c:697] listing device /dev/nvidia0 (GPU-06986d8e-47c3-467c-c6bc-0a30ae3fbd30 at 00000000:01:00.0) NVRM version: 470.63.01 CUDA version: 11.4

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-06986d8e-47c3-467c-c6bc-0a30ae3fbd30 Bus Location: 00000000:01:00.0 Architecture: 6.1 I0913 20:19:32.966997 591 nvc.c:423] shutting down library context I0913 20:19:32.967215 592 driver.c:163] terminating driver service I0913 20:19:32.967500 591 driver.c:203] driver service terminated successfully

 - [X ] Kernel version from `uname -a`

Linux demo4 5.11.0-34-generic NVIDIA/nvidia-docker#36-Ubuntu SMP Thu Aug 26 19:22:09 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

 - [X] Any relevant kernel output lines from `dmesg`

[ 1517.263463] docker0: port 1(vethce5fe56) entered blocking state [ 1517.263475] docker0: port 1(vethce5fe56) entered disabled state [ 1517.263652] device vethce5fe56 entered promiscuous mode [ 1517.622529] docker0: port 1(vethce5fe56) entered disabled state [ 1517.628590] device vethce5fe56 left promiscuous mode [ 1517.628603] docker0: port 1(vethce5fe56) entered disabled state

 - [X] Driver information from `nvidia-smi -a`

from the host

==============NVSMI LOG==============

Timestamp : Mon Sep 13 22:23:14 2021 Driver Version : 470.63.01 CUDA Version : 11.4

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce GTX 1080 Ti Product Brand : GeForce Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-06986d8e-47c3-467c-c6bc-0a30ae3fbd30 Minor Number : 0 VBIOS Version : 86.02.39.00.FF MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:01:00.0 Sub System Id : 0x376A1458 GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 16 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11175 MiB Used : 178 MiB Free : 10997 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 28 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 16.48 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 375.00 W Clocks Graphics : 139 MHz SM : 139 MHz Memory : 405 MHz Video : 544 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2037 MHz SM : 2037 MHz Memory : 5616 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 5026 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 167 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 5324 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 8 MiB

from the LXD container:

==============NVSMI LOG==============

Timestamp : Mon Sep 13 20:23:59 2021 Driver Version : 470.63.01 CUDA Version : 11.4

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce GTX 1080 Ti Product Brand : GeForce Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-06986d8e-47c3-467c-c6bc-0a30ae3fbd30 Minor Number : 0 VBIOS Version : 86.02.39.00.FF MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:01:00.0 Sub System Id : 0x376A1458 GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 16 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11175 MiB Used : 178 MiB Free : 10997 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 28 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 17.30 W Power Limit : 250.00 W Default Power Limit : 250.00 W Enforced Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 375.00 W Clocks Graphics : 139 MHz SM : 139 MHz Memory : 405 MHz Video : 544 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2037 MHz SM : 2037 MHz Memory : 5616 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None

 - [X] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.8 API version: 1.41 Go version: go1.16.6 Git commit: 3967b7d Built: Fri Jul 30 19:53:57 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.8 API version: 1.41 (minimum version 1.12) Go version: go1.16.6 Git commit: 75249d8 Built: Fri Jul 30 19:52:06 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.9 GitCommit: e25210fe30a0a703442421b0f60afac609f950a3 runc: Version: 1.0.1 GitCommit: v1.0.1-0-g4144b63 docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [X] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

from the host:

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-==========================-============-========================================================= un libgldispatch0-nvidia (no description available) ii libnvidia-cfg1-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) un libnvidia-common (no description available) ii libnvidia-common-470 470.63.01-0ubuntu0.21.04.2 all Shared files used by the NVIDIA libraries un libnvidia-compute (no description available) rc libnvidia-compute-460:amd64 460.73.01-0ubuntu1 amd64 NVIDIA libcompute package rc libnvidia-compute-465:amd64 465.19.01-0ubuntu1 amd64 NVIDIA libcompute package ii libnvidia-compute-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA libcompute package ii libnvidia-container-tools 1.3.3-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.3.3-1 amd64 NVIDIA container runtime library un libnvidia-decode (no description available) ii libnvidia-decode-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA Video Decoding runtime libraries un libnvidia-encode (no description available) ii libnvidia-encode-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVENC Video Encoding runtime library un libnvidia-extra (no description available) ii libnvidia-extra-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 Extra libraries for the NVIDIA driver un libnvidia-fbc1 (no description available) ii libnvidia-fbc1-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library un libnvidia-gl (no description available) ii libnvidia-gl-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD un libnvidia-ifr1 (no description available) ii libnvidia-ifr1-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library un libnvidia-ml1 (no description available) un nvidia-384 (no description available) un nvidia-390 (no description available) un nvidia-common (no description available) un nvidia-compute-utils (no description available) rc nvidia-compute-utils-460 460.73.01-0ubuntu1 amd64 NVIDIA compute utilities rc nvidia-compute-utils-465 465.19.01-0ubuntu1 amd64 NVIDIA compute utilities ii nvidia-compute-utils-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA compute utilities ii nvidia-container-runtime 3.4.2-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.4.2-1 amd64 NVIDIA container runtime hook rc nvidia-dkms-460 460.73.01-0ubuntu1 amd64 NVIDIA DKMS package rc nvidia-dkms-465 465.19.01-0ubuntu1 amd64 NVIDIA DKMS package ii nvidia-dkms-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA DKMS package un nvidia-dkms-kernel (no description available) un nvidia-docker (no description available) ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper ii nvidia-driver-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA driver metapackage un nvidia-driver-binary (no description available) un nvidia-kernel-common (no description available) rc nvidia-kernel-common-460 460.73.01-0ubuntu1 amd64 Shared files used with the kernel module rc nvidia-kernel-common-465 465.19.01-0ubuntu1 amd64 Shared files used with the kernel module ii nvidia-kernel-common-470 470.63.01-0ubuntu0.21.04.2 amd64 Shared files used with the kernel module un nvidia-kernel-source (no description available) un nvidia-kernel-source-460 (no description available) un nvidia-kernel-source-465 (no description available) ii nvidia-kernel-source-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA kernel source package un nvidia-libopencl1-dev (no description available) ii nvidia-modprobe 470.57.02-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files un nvidia-opencl-icd (no description available) un nvidia-persistenced (no description available) ii nvidia-prime 0.8.16.1 all Tools to enable NVIDIA's Prime ii nvidia-settings 470.57.02-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver un nvidia-settings-binary (no description available) un nvidia-smi (no description available) un nvidia-utils (no description available) ii nvidia-utils-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA driver support binaries ii xserver-xorg-video-nvidia-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA binary Xorg driver

from within the LXD container:

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-==========================-============-========================================================= un libgldispatch0-nvidia (no description available) ii libnvidia-cfg1-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) un libnvidia-common (no description available) ii libnvidia-common-470 470.63.01-0ubuntu0.21.04.2 all Shared files used by the NVIDIA libraries un libnvidia-compute (no description available) ii libnvidia-compute-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA libcompute package ii libnvidia-container-tools 1.5.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.5.0-1 amd64 NVIDIA container runtime library un libnvidia-decode (no description available) ii libnvidia-decode-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA Video Decoding runtime libraries un libnvidia-encode (no description available) ii libnvidia-encode-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVENC Video Encoding runtime library un libnvidia-extra (no description available) ii libnvidia-extra-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 Extra libraries for the NVIDIA driver un libnvidia-fbc1 (no description available) ii libnvidia-fbc1-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library un libnvidia-gl (no description available) ii libnvidia-gl-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD un libnvidia-ifr1 (no description available) ii libnvidia-ifr1-470:amd64 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library un libnvidia-ml1 (no description available) un nvidia-384 (no description available) un nvidia-390 (no description available) un nvidia-compute-utils (no description available) ii nvidia-compute-utils-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA compute utilities ii nvidia-container-runtime 3.5.0-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.5.1-1 amd64 NVIDIA container runtime hook ii nvidia-dkms-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA DKMS package un nvidia-dkms-kernel (no description available) un nvidia-docker (no description available) ii nvidia-docker2 2.6.0-1 all nvidia-docker CLI wrapper ii nvidia-driver-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA driver metapackage un nvidia-driver-binary (no description available) un nvidia-kernel-common (no description available) ii nvidia-kernel-common-470 470.63.01-0ubuntu0.21.04.2 amd64 Shared files used with the kernel module un nvidia-kernel-source (no description available) ii nvidia-kernel-source-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA kernel source package un nvidia-opencl-icd (no description available) un nvidia-persistenced (no description available) un nvidia-prime (no description available) un nvidia-settings (no description available) un nvidia-smi (no description available) un nvidia-utils (no description available) ii nvidia-utils-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA driver support binaries ii xserver-xorg-video-nvidia-470 470.63.01-0ubuntu0.21.04.2 amd64 NVIDIA binary Xorg driver

 - [X] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.5.0 build date: 2021-09-02T08:39+00:00 build revision: 4699c1b8b4991b6d869ea403e109291653bb040b build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [X] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))

cat /var/log/nvidia-container-toolkit.log

-- WARNING, the following logs are for debugging purposes only --

I0913 20:33:29.853991 1004 nvc.c:372] initializing library context (version=1.5.0, build=4699c1b8b4991b6d869ea403e109291653bb040b) I0913 20:33:29.854198 1004 nvc.c:346] using root / I0913 20:33:29.854244 1004 nvc.c:347] using ldcache /etc/ld.so.cache I0913 20:33:29.854281 1004 nvc.c:348] using unprivileged user 65534:65534 I0913 20:33:29.854338 1004 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0913 20:33:29.854676 1004 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment W0913 20:33:29.854752 1004 nvc.c:249] skipping kernel modules load due to user namespace I0913 20:33:29.854976 1010 driver.c:101] starting driver service I0913 20:33:29.863469 1004 nvc_container.c:388] configuring container with 'compute utility supervised' I0913 20:33:29.863950 1004 nvc_container.c:236] selecting /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/local/cuda-11.4/compat/libcuda.so.470.57.02 I0913 20:33:29.864131 1004 nvc_container.c:236] selecting /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.57.02 I0913 20:33:29.864518 1004 nvc_container.c:408] setting pid to 998 I0913 20:33:29.864566 1004 nvc_container.c:409] setting rootfs to /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0 I0913 20:33:29.864619 1004 nvc_container.c:410] setting owner to 0:0 I0913 20:33:29.864656 1004 nvc_container.c:411] setting bins directory to /usr/bin I0913 20:33:29.864693 1004 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu I0913 20:33:29.864728 1004 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu I0913 20:33:29.864764 1004 nvc_container.c:414] setting cudart directory to /usr/local/cuda I0913 20:33:29.864800 1004 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig.real (host relative) I0913 20:33:29.864847 1004 nvc_container.c:416] setting mount namespace to /proc/998/ns/mnt I0913 20:33:29.864883 1004 nvc_container.c:418] setting devices cgroup to /sys/fs/cgroup/devices/docker/dd7f4ee43c878e6ce63ccaba0c9b9a10d2834add60afb23ae14db0d2f90fb694 I0913 20:33:29.864928 1004 nvc_info.c:750] requesting driver information with '' I0913 20:33:29.866900 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.63.01 I0913 20:33:29.867044 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.63.01 I0913 20:33:29.867174 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.63.01 I0913 20:33:29.867286 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.63.01 I0913 20:33:29.867431 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.63.01 I0913 20:33:29.867572 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.63.01 I0913 20:33:29.867698 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.63.01 I0913 20:33:29.867809 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.63.01 I0913 20:33:29.867957 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.63.01 I0913 20:33:29.868097 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.63.01 I0913 20:33:29.868202 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.63.01 I0913 20:33:29.868324 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.63.01 I0913 20:33:29.868435 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.63.01 I0913 20:33:29.868578 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.63.01 I0913 20:33:29.868720 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.63.01 I0913 20:33:29.868826 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.63.01 I0913 20:33:29.868953 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.63.01 I0913 20:33:29.869096 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.63.01 I0913 20:33:29.869204 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.63.01 I0913 20:33:29.869348 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.63.01 I0913 20:33:29.869595 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.63.01 I0913 20:33:29.869810 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.63.01 I0913 20:33:29.869925 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.63.01 I0913 20:33:29.870033 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.63.01 I0913 20:33:29.870168 1004 nvc_info.c:171] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.63.01 W0913 20:33:29.870258 1004 nvc_info.c:392] missing library libnvidia-nscq.so W0913 20:33:29.870301 1004 nvc_info.c:392] missing library libnvidia-fatbinaryloader.so W0913 20:33:29.870337 1004 nvc_info.c:392] missing library libvdpau_nvidia.so W0913 20:33:29.870373 1004 nvc_info.c:396] missing compat32 library libnvidia-ml.so W0913 20:33:29.870409 1004 nvc_info.c:396] missing compat32 library libnvidia-cfg.so W0913 20:33:29.870444 1004 nvc_info.c:396] missing compat32 library libnvidia-nscq.so W0913 20:33:29.870494 1004 nvc_info.c:396] missing compat32 library libcuda.so W0913 20:33:29.870531 1004 nvc_info.c:396] missing compat32 library libnvidia-opencl.so W0913 20:33:29.870567 1004 nvc_info.c:396] missing compat32 library libnvidia-ptxjitcompiler.so W0913 20:33:29.870602 1004 nvc_info.c:396] missing compat32 library libnvidia-fatbinaryloader.so W0913 20:33:29.870638 1004 nvc_info.c:396] missing compat32 library libnvidia-allocator.so W0913 20:33:29.870673 1004 nvc_info.c:396] missing compat32 library libnvidia-compiler.so W0913 20:33:29.870722 1004 nvc_info.c:396] missing compat32 library libnvidia-ngx.so W0913 20:33:29.870758 1004 nvc_info.c:396] missing compat32 library libvdpau_nvidia.so W0913 20:33:29.870795 1004 nvc_info.c:396] missing compat32 library libnvidia-encode.so W0913 20:33:29.870830 1004 nvc_info.c:396] missing compat32 library libnvidia-opticalflow.so W0913 20:33:29.870866 1004 nvc_info.c:396] missing compat32 library libnvcuvid.so W0913 20:33:29.870902 1004 nvc_info.c:396] missing compat32 library libnvidia-eglcore.so W0913 20:33:29.870949 1004 nvc_info.c:396] missing compat32 library libnvidia-glcore.so W0913 20:33:29.870985 1004 nvc_info.c:396] missing compat32 library libnvidia-tls.so W0913 20:33:29.871021 1004 nvc_info.c:396] missing compat32 library libnvidia-glsi.so W0913 20:33:29.871057 1004 nvc_info.c:396] missing compat32 library libnvidia-fbc.so W0913 20:33:29.871092 1004 nvc_info.c:396] missing compat32 library libnvidia-ifr.so W0913 20:33:29.871128 1004 nvc_info.c:396] missing compat32 library libnvidia-rtcore.so W0913 20:33:29.871176 1004 nvc_info.c:396] missing compat32 library libnvoptix.so W0913 20:33:29.871213 1004 nvc_info.c:396] missing compat32 library libGLX_nvidia.so W0913 20:33:29.871248 1004 nvc_info.c:396] missing compat32 library libEGL_nvidia.so W0913 20:33:29.871283 1004 nvc_info.c:396] missing compat32 library libGLESv2_nvidia.so W0913 20:33:29.871319 1004 nvc_info.c:396] missing compat32 library libGLESv1_CM_nvidia.so W0913 20:33:29.871354 1004 nvc_info.c:396] missing compat32 library libnvidia-glvkspirv.so W0913 20:33:29.871402 1004 nvc_info.c:396] missing compat32 library libnvidia-cbl.so I0913 20:33:29.871985 1004 nvc_info.c:297] selecting /usr/bin/nvidia-smi I0913 20:33:29.872089 1004 nvc_info.c:297] selecting /usr/bin/nvidia-debugdump I0913 20:33:29.872170 1004 nvc_info.c:297] selecting /usr/bin/nvidia-persistenced I0913 20:33:29.872269 1004 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-control I0913 20:33:29.872342 1004 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-server W0913 20:33:29.872652 1004 nvc_info.c:418] missing binary nv-fabricmanager I0913 20:33:29.872750 1004 nvc_info.c:512] listing device /dev/nvidiactl I0913 20:33:29.872793 1004 nvc_info.c:512] listing device /dev/nvidia-uvm I0913 20:33:29.872829 1004 nvc_info.c:512] listing device /dev/nvidia-uvm-tools I0913 20:33:29.872865 1004 nvc_info.c:512] listing device /dev/nvidia-modeset W0913 20:33:29.872945 1004 nvc_info.c:342] missing ipc /var/run/nvidia-persistenced/socket W0913 20:33:29.873024 1004 nvc_info.c:342] missing ipc /var/run/nvidia-fabricmanager/socket W0913 20:33:29.873107 1004 nvc_info.c:342] missing ipc /tmp/nvidia-mps I0913 20:33:29.873149 1004 nvc_info.c:805] requesting device information with '' I0913 20:33:29.880072 1004 nvc_info.c:697] listing device /dev/nvidia0 (GPU-06986d8e-47c3-467c-c6bc-0a30ae3fbd30 at 00000000:01:00.0) I0913 20:33:29.880343 1004 nvc_mount.c:344] mounting tmpfs at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/proc/driver/nvidia I0913 20:33:29.881942 1004 nvc_mount.c:112] mounting /usr/bin/nvidia-smi at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/bin/nvidia-smi I0913 20:33:29.882284 1004 nvc_mount.c:112] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/bin/nvidia-debugdump I0913 20:33:29.882570 1004 nvc_mount.c:112] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/bin/nvidia-persistenced I0913 20:33:29.882896 1004 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/bin/nvidia-cuda-mps-control I0913 20:33:29.883229 1004 nvc_mount.c:112] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/bin/nvidia-cuda-mps-server I0913 20:33:29.883827 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.63.01 I0913 20:33:29.884117 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.63.01 I0913 20:33:29.884449 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libcuda.so.470.63.01 I0913 20:33:29.884795 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.63.01 I0913 20:33:29.885117 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.63.01 I0913 20:33:29.885396 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.63.01 I0913 20:33:29.885710 1004 nvc_mount.c:112] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.63.01 I0913 20:33:29.885866 1004 nvc_mount.c:524] creating symlink /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 I0913 20:33:29.886609 1004 nvc_mount.c:63] mounting /lib/firmware/nvidia/470.63.01 at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/usr/lib/firmware/nvidia/470.63.01 I0913 20:33:29.886913 1004 nvc_mount.c:208] mounting /dev/nvidiactl at /var/lib/docker/btrfs/subvolumes/be11006c908fb293162fe6b4ded3bdacc0858a9f4f82a98372c000d5e769f6e0/dev/nvidiactl I0913 20:33:29.887090 1004 nvc_mount.c:499] whitelisting device node 195:255 I0913 20:33:29.889227 1004 nvc.c:423] shutting down library context I0913 20:33:29.890167 1010 driver.c:163] terminating driver service I0913 20:33:29.891254 1004 driver.c:203] driver service terminated successfully

 - [X] Docker command, image and tag used

Mon Sep 13 20:41:24 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 16% 27C P8 16W / 250W | 178MiB / 11175MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: write error: /sys/fs/cgroup/devices/docker/9e199f6f3e7e69766ce196d617b7e623f506c186b371fd732250ef8d1f1f0631/devices.allow: operation not permitted: unknown.```

iegorval commented 2 years ago

@waldekkot did you manage to solve it somehow? Thanks!

sfc-gh-wkot commented 2 years ago

@iegorval unfortunately, there have been no changes to how it works...

klueska commented 2 years ago

We are currently in the processing of re-architecing the nvidia-docker stack, and I'd be curious to know if this issue is resolved by the new stack.

Can you try replacing your current nvidia-container-runtime binary with the "experimental" one from here:

docker cp $(docker create --rm nvcr.io/nvidia/k8s/container-toolkit:v1.8.0-rc.2-ubuntu18.04):/work/nvidia-container-runtime.experimental .

And then invoke docker using an NVIDIA_VISIBLE_DEVICES envvar rather than the --gpus flag.

vsltimkay commented 2 years ago

A quick update to this in case @waldekkot has moved on:

Container toolkit version v1.8.0-rc.2-ubuntu18.04 as above is now the standard install via apt if you've configured the experimental packages repo. Using that version (or the file pulled from the container above) the problem still exists as detailed. There's (slightly) more info in the error message in that it now states "failed to add device rules":

docker -D run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 nvidia/cuda:11.0-base nvidia-smi

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook NVIDIA/nvidia-docker#1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: write /sys/fs/cgroup/devices/docker/ca6bba3e85ac368ca5310907cbcd9b2fd404c83077323cd84b49a3b541019785/devices.allow: operation not permitted: unknown.

The error is identical whether you use ENVVARS or --gpus as arguments.

klueska commented 2 years ago

This should have been fixed in v1.8.1.

vsltimkay commented 2 years ago

Hmm, Assuming the 1.9.0-1 release is ahead of that it's still broken there. I confess I'm struggling to debug this, obviously happy to diagnose further if anyone can point me in the correct direction:

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 nvidia/cuda:11.0-base nvidia-smi

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook NVIDIA/nvidia-docker#1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: write /sys/fs/cgroup/devices/docker/f9352b6b081710baa40d6ba036102e79c228c063afc16ca88ef21212f02f0ad5/devices.allow: operation not permitted: unknown. ERRO[0002] error waiting for container: context canceled


ii  libnvidia-container-tools      1.9.0-1                             amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64     1.9.0-1                             amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit       1.9.0-1                             amd64        NVIDIA container runtime hook
ii  nvidia-docker2                 2.10.0-1                            all          nvidia-docker CLI wrapper```
klueska commented 2 years ago

Hmm. So this works for you if you downgrade to say, libnvidia-container v1.7.0? But it's broken on the latest?

klueska commented 2 years ago

Looking more closely at the linked issue, it seems that this is failing "by design" at the moment (and would also fail on older versions of libnvidia-container not just the newest one).

That error should really be non-fatal in the case of nested containers. It may be worth filing an issue against nvidia-container to have them relax error handling on this particular case.

Unprivileged containers aren’t allowed to modify devices.allow/devices.deny but that doesn’t mean the device in question isn’t already allowed (as it is in this case).

I think what you want to do is probably just uncomment no-cgroups = true in your /etc/nvidia-container-runtime/config.toml file.

vsltimkay commented 2 years ago

Excellent! Thank you @klueska, that fixed the issue there.

For reference nvidia-docker is still not working in unprivileged mode as above without some more work. It's necessary to set raw.apparmor values within LXC to allow access to /proc/driver/nvidia/gpus/0000:bu:s_id.0 as otherwise nvidia-container-cli fails to mount. That's very much an LXC thing rather than a nvidia-docker issue though.

Thanks again.

HarryCoops commented 1 year ago

Hey @vsltimkay, I appreciate this is a fairly old thread but could you possibly share how you set the apparmor values to allow the container access to the /proc/driver/nvidia/gpus directory? I've followed the other steps in the thread and currently all my container can see in /proc/driver/nvidia is params registry version.

Thanks!

the729 commented 1 year ago

Looking more closely at the linked issue, it seems that this is failing "by design" at the moment (and would also fail on older versions of libnvidia-container not just the newest one).

That error should really be non-fatal in the case of nested containers. It may be worth filing an issue against nvidia-container to have them relax error handling on this particular case. Unprivileged containers aren’t allowed to modify devices.allow/devices.deny but that doesn’t mean the device in question isn’t already allowed (as it is in this case).

I think what you want to do is probably just uncomment no-cgroups = true in your /etc/nvidia-container-runtime/config.toml file.

The workaround of setting no-cgroups = true does not work with NVIDIA Container Toolkit v1.14.0. It works with v1.13.x.

elezar commented 1 year ago

Looking more closely at the linked issue, it seems that this is failing "by design" at the moment (and would also fail on older versions of libnvidia-container not just the newest one).

That error should really be non-fatal in the case of nested containers. It may be worth filing an issue against nvidia-container to have them relax error handling on this particular case. Unprivileged containers aren’t allowed to modify devices.allow/devices.deny but that doesn’t mean the device in question isn’t already allowed (as it is in this case).

I think what you want to do is probably just uncomment no-cgroups = true in your /etc/nvidia-container-runtime/config.toml file.

The workaround of setting no-cgroups = true does not work with NVIDIA Container Toolkit v1.14.0. It works with v1.13.x.

@the729 there is a known issue in the 1.14.0 release related to applying config options from the config file. This has been resolved in the 1.14.1 release. Is that available to you?

the729 commented 1 year ago

@the729 there is a known issue in the 1.14.0 release related to applying config options from the config file. This has been resolved in the 1.14.1 release. Is that available to you?

It works. Thank you.