cgroup issue with nvidia container runtime on Debian testing

super-cooper commented 3 years ago

1. Issue or feature description

Whenever I try to build or run an NVidia container, Docker fails with the error message:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

2. Steps to reproduce the issue

$ docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

[x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info


I0107 20:43:11.917241 36435 nvc.c:282] initializing library context (version=1.3.1, build=ac02636a318fe7dcc71eaeb3cc55d0c8541c1072)
I0107 20:43:11.917283 36435 nvc.c:256] using root /
I0107 20:43:11.917290 36435 nvc.c:257] using ldcache /etc/ld.so.cache
I0107 20:43:11.917300 36435 nvc.c:258] using unprivileged user 1000:1000
I0107 20:43:11.917316 36435 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0107 20:43:11.917404 36435 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
W0107 20:43:11.918351 36436 nvc.c:187] failed to set inheritable capabilities
W0107 20:43:11.918381 36436 nvc.c:188] skipping kernel modules load due to failure
I0107 20:43:11.918527 36437 driver.c:101] starting driver service
I0107 20:43:11.921734 36435 nvc_info.c:680] requesting driver information with ''
I0107 20:43:11.932012 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.450.80.02
I0107 20:43:11.932402 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.450.80.02
I0107 20:43:11.932976 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.450.80.02
I0107 20:43:11.933027 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.450.80.02
I0107 20:43:11.933435 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.450.80.02
I0107 20:43:11.933470 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.450.80.02
I0107 20:43:11.933501 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.80.02
I0107 20:43:11.933991 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.450.80.02
I0107 20:43:11.934024 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.450.80.02
I0107 20:43:11.934094 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.450.80.02
I0107 20:43:11.934545 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.450.80.02
I0107 20:43:11.934976 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.450.80.02
I0107 20:43:11.935258 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.450.80.02
I0107 20:43:11.935783 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.450.80.02
I0107 20:43:11.936188 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.450.80.02
I0107 20:43:11.936243 36435 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.450.80.02
I0107 20:43:11.936622 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.450.80.02
I0107 20:43:11.937013 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.450.80.02
I0107 20:43:11.937296 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.450.80.02
I0107 20:43:11.937573 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.450.80.02
I0107 20:43:11.937881 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.450.80.02
I0107 20:43:11.938438 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libGLX_nvidia.so.450.80.02
I0107 20:43:11.938920 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libGLESv2_nvidia.so.450.80.02
I0107 20:43:11.939282 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.450.80.02
I0107 20:43:11.939730 36435 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/nvidia/current/libEGL_nvidia.so.450.80.02
W0107 20:43:11.939751 36435 nvc_info.c:350] missing library libnvidia-opencl.so
W0107 20:43:11.939756 36435 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0107 20:43:11.939761 36435 nvc_info.c:350] missing library libnvidia-allocator.so
W0107 20:43:11.939767 36435 nvc_info.c:350] missing library libnvidia-compiler.so
W0107 20:43:11.939772 36435 nvc_info.c:350] missing library libnvidia-ngx.so
W0107 20:43:11.939776 36435 nvc_info.c:350] missing library libvdpau_nvidia.so
W0107 20:43:11.939780 36435 nvc_info.c:350] missing library libnvidia-opticalflow.so
W0107 20:43:11.939785 36435 nvc_info.c:350] missing library libnvidia-fbc.so
W0107 20:43:11.939790 36435 nvc_info.c:350] missing library libnvidia-ifr.so
W0107 20:43:11.939795 36435 nvc_info.c:350] missing library libnvoptix.so
W0107 20:43:11.939801 36435 nvc_info.c:350] missing library libnvidia-cbl.so
W0107 20:43:11.939805 36435 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0107 20:43:11.939810 36435 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0107 20:43:11.939814 36435 nvc_info.c:354] missing compat32 library libcuda.so
W0107 20:43:11.939818 36435 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0107 20:43:11.939823 36435 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0107 20:43:11.939828 36435 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0107 20:43:11.939832 36435 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0107 20:43:11.939837 36435 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0107 20:43:11.939841 36435 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0107 20:43:11.939846 36435 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0107 20:43:11.939851 36435 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0107 20:43:11.939856 36435 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0107 20:43:11.939860 36435 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0107 20:43:11.939865 36435 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0107 20:43:11.939870 36435 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0107 20:43:11.939874 36435 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0107 20:43:11.939879 36435 nvc_info.c:354] missing compat32 library libnvoptix.so
W0107 20:43:11.939884 36435 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0107 20:43:11.940108 36435 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-smi
I0107 20:43:11.940153 36435 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0107 20:43:11.940169 36435 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
W0107 20:43:11.941108 36435 nvc_info.c:376] missing binary nvidia-cuda-mps-control
W0107 20:43:11.941117 36435 nvc_info.c:376] missing binary nvidia-cuda-mps-server
I0107 20:43:11.941136 36435 nvc_info.c:438] listing device /dev/nvidiactl
I0107 20:43:11.941142 36435 nvc_info.c:438] listing device /dev/nvidia-uvm
I0107 20:43:11.941146 36435 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0107 20:43:11.941151 36435 nvc_info.c:438] listing device /dev/nvidia-modeset
I0107 20:43:11.941175 36435 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0107 20:43:11.941193 36435 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0107 20:43:11.941198 36435 nvc_info.c:745] requesting device information with ''
I0107 20:43:11.947879 36435 nvc_info.c:628] listing device /dev/nvidia0 (GPU-6518be5e-14ff-e277-21aa-73b482890bee at 00000000:07:00.0)
NVRM version:   450.80.02
CUDA version:   11.0

Device Index: 0 Device Minor: 0 Model: GeForce GTX 980 Ti Brand: GeForce GPU UUID: GPU-6518be5e-14ff-e277-21aa-73b482890bee Bus Location: 00000000:07:00.0 Architecture: 5.2 I0107 20:43:11.947903 36435 nvc.c:337] shutting down library context I0107 20:43:11.948696 36437 driver.c:156] terminating driver service I0107 20:43:11.949026 36435 driver.c:196] driver service terminated successfully

 - [x] Kernel version from `uname -a`

Linux lambda 5.8.0-3-amd64 #1 SMP Debian 5.8.14-1 (2020-10-10) x86_64 GNU/Linux

 - [ ] Any relevant kernel output lines from `dmesg`
 - [x] Driver information from `nvidia-smi -a`

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3023 G /usr/lib/xorg/Xorg 177MiB | | 0 N/A N/A 4833 G /usr/bin/gnome-shell 166MiB | | 0 N/A N/A 7609 G ...AAAAAAAAA= --shared-files 54MiB | +-----------------------------------------------------------------------------+

 - [x] Docker version from `docker version`

Server: Docker Engine - Community Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8891c58 Built: Mon Dec 28 16:15:28 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b nvidia: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Re | Status=Not/Inst/Conf-fil |/ Err?=(none)/Reinst-required ||/ Name +++-====================== un bumblebee-nvidia ii glx-alternative-nvidia un libegl-nvidia-legacy-390xx0 un libegl-nvidia-tesla-418-0 un libegl-nvidia-tesla-440-0 un libegl-nvidia-tesla-450-0 ii libegl-nvidia0:amd64 ii libegl-nvidia0:i386 un libegl1-glvnd-nvidia un libegl1-nvidia un libgl1-glvnd-nvidia-glx ii libgl1-nvidia-glvnd-glx:amd64 ii libgl1-nvidia-glvnd-glx:i386 un libgl1-nvidia-glx un libgl1-nvidia-glx-any un libgl1-nvidia-glx-i386 un libgl1-nvidia-legacy-390xx-glx un libgl1-nvidia-tesla-418-glx un libgldispatch0-nvidia ii libgles-nvidia1:amd64 ii libgles-nvidia1:i386 ii libgles-nvidia2:amd64 ii libgles-nvidia2:i386 un libgles1-glvnd-nvidia un libgles2-glvnd-nvidia un libglvnd0-nvidia ii libglx-nvidia0:amd64 ii libglx-nvidia0:i386 un libglx0-glvnd-nvidia un libnvidia-cbl un libnvidia-cfg.so.1 ii libnvidia-cfg1:amd64 un libnvidia-cfg1-any ii libnvidia-container-tools ii libnvidia-container1:amd64 ii libnvidia-eglcore:amd64 ii libnvidia-eglcore:i386 un libnvidia-eglcore-450.80.02 ii libnvidia-encode1:amd64 ii libnvidia-glcore:amd64 ii libnvidia-glcore:i386 un libnvidia-glcore-450.80.02 ii libnvidia-glvkspirv:amd64 ii libnvidia-glvkspirv:i386 un libnvidia-glvkspirv-450.80.02 un libnvidia-legacy-340xx-cfg1 un libnvidia-legacy-390xx-cfg1 ii libnvidia-ml-dev:amd64 un libnvidia-ml.so.1 ii libnvidia-ml1:amd64 ii libnvidia-ptxjitcompiler1:amd64 ii libnvidia-rtcore:amd64 un libnvidia-rtcore-450.80.02 un libnvidia-tesla-418-cfg1 un libnvidia-tesla-440-cfg1 un libnvidia-tesla-450-cfg1 un libnvidia-tesla-450-cuda1 un libnvidia-tesla-450-ml1 un libopengl0-glvnd-nvidia ii nvidia-alternative un nvidia-alternative--kmod-alias un nvidia-alternative-legacy-173xx un nvidia-alternative-legacy-71xx un nvidia-alternative-legacy-96xx ii nvidia-container-runtime un nvidia-container-runtime-hook ii nvidia-container-toolkit ii nvidia-cuda-dev:amd64 un nvidia-cuda-doc ii nvidia-cuda-gdb un nvidia-cuda-mps ii nvidia-cuda-toolkit ii nvidia-cuda-toolkit-doc un nvidia-current un nvidia-current-updates un nvidia-docker ii nvidia-docker2 ii nvidia-driver un nvidia-driver-any ii nvidia-driver-bin un nvidia-driver-bin-450.80.02 un nvidia-driver-binary ii nvidia-driver-libs:amd64 ii nvidia-driver-libs:i386 un nvidia-driver-libs-any un nvidia-driver-libs-nonglvnd ii nvidia-egl-common ii nvidia-egl-icd:amd64 ii nvidia-egl-icd:i386 un nvidia-glx-any ii nvidia-installer-cleanup un nvidia-kernel-450.80.02 ii nvidia-kernel-common ii nvidia-kernel-dkms un nvidia-kernel-source ii nvidia-kernel-support un nvidia-kernel-support--v1 un nvidia-kernel-support-any un nvidia-legacy-304xx-alternative un nvidia-legacy-304xx-driver un nvidia-legacy-340xx-alternative un nvidia-legacy-340xx-vdpau-driver un nvidia-legacy-390xx-vdpau-driver un nvidia-legacy-390xx-vulkan-icd ii nvidia-legacy-check un nvidia-libopencl1 un nvidia-libopencl1-dev ii nvidia-modprobe un nvidia-nonglvnd-vulkan-common un nvidia-nonglvnd-vulkan-icd un nvidia-opencl-dev un nvidia-opencl-icd un nvidia-openjdk-8-jre ii nvidia-persistenced ii nvidia-profiler ii nvidia-settings un nvidia-settings-gtk-450.80.02 ii nvidia-smi ii nvidia-support un nvidia-tesla-418-vdpau-driver un nvidia-tesla-418-vulkan-icd un nvidia-tesla-440-vdpau-driver un nvidia-tesla-440-vulkan-icd un nvidia-tesla-450-driver un nvidia-tesla-450-vulkan-icd un nvidia-tesla-alternative ii nvidia-vdpau-driver:amd64 ii nvidia-visual-profiler ii nvidia-vulkan-common ii nvidia-vulkan-icd:amd64 ii nvidia-vulkan-icd:i386 un nvidia-vulkan-icd-any ii xserver-xorg-video-nvidia un xserver-xorg-video-nvidia-any un xserver-xorg-video-nvi move/Purge/Hold es/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend (Status,Err: uppercase=bad) Version Architecture Description ================-==============-============-================================================================= (no description available) 1.2.0 amd64 allows the selection of NVIDIA as GLX provider (no description available) (no description available) (no description available) (no description available) 450.80.02-2 amd64 NVIDIA binary EGL library 450.80.02-2 i386 NVIDIA binary EGL library (no description available) (no description available) (no description available) 450.80.02-2 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant) 450.80.02-2 i386 NVIDIA binary OpenGL/GLX library (GLVND variant) (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) 450.80.02-2 amd64 NVIDIA binary OpenGL|ES 1.x library 450.80.02-2 i386 NVIDIA binary OpenGL|ES 1.x library 450.80.02-2 amd64 NVIDIA binary OpenGL|ES 2.x library 450.80.02-2 i386 NVIDIA binary OpenGL|ES 2.x library (no description available) (no description available) (no description available) 450.80.02-2 amd64 NVIDIA binary GLX library 450.80.02-2 i386 NVIDIA binary GLX library (no description available) (no description available) (no description available) 450.80.02-2 amd64 NVIDIA binary OpenGL/GLX configuration library (no description available) 1.3.1-1 amd64 NVIDIA container runtime library (command-line tools) 1.3.1-1 amd64 NVIDIA container runtime library 450.80.02-2 amd64 NVIDIA binary EGL core libraries 450.80.02-2 i386 NVIDIA binary EGL core libraries (no description available) 450.80.02-2 amd64 NVENC Video Encoding runtime library 450.80.02-2 amd64 NVIDIA binary OpenGL/GLX core libraries 450.80.02-2 i386 NVIDIA binary OpenGL/GLX core libraries (no description available) 450.80.02-2 amd64 NVIDIA binary Vulkan Spir-V compiler library 450.80.02-2 i386 NVIDIA binary Vulkan Spir-V compiler library (no description available) (no description available) (no description available) 11.1.1-3 amd64 NVIDIA Management Library (NVML) development files (no description available) 450.80.02-2 amd64 NVIDIA Management Library (NVML) runtime library 450.80.02-2 amd64 NVIDIA PTX JIT Compiler 450.80.02-2 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) 450.80.02-2 amd64 allows the selection of NVIDIA as GLX provider (no description available) (no description available) (no description available) (no description available) 3.4.0-1 amd64 NVIDIA container runtime (no description available) 1.4.0-1 amd64 NVIDIA container runtime hook 11.1.1-3 amd64 NVIDIA CUDA development files (no description available) 11.1.1-3 amd64 NVIDIA CUDA Debugger (GDB) (no description available) 11.1.1-3 amd64 NVIDIA CUDA development toolkit 11.1.1-3 all NVIDIA CUDA and OpenCL documentation (no description available) (no description available) (no description available) 2.5.0-1 all nvidia-docker CLI wrapper 450.80.02-2 amd64 NVIDIA metapackage (no description available) 450.80.02-2 amd64 NVIDIA driver support binaries (no description available) (no description available) 450.80.02-2 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) 450.80.02-2 i386 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) (no description available) (no description available) 450.80.02-2 amd64 NVIDIA binary EGL driver - common files 450.80.02-2 amd64 NVIDIA EGL installable client driver (ICD) 450.80.02-2 i386 NVIDIA EGL installable client driver (ICD) (no description available) 20151021+12 amd64 cleanup after driver installation with the nvidia-installer (no description available) 20151021+12 amd64 NVIDIA binary kernel module support files 450.80.02-2 amd64 NVIDIA binary kernel module DKMS source (no description available) 450.80.02-2 amd64 NVIDIA binary kernel module support files (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) 450.80.02-2 amd64 check for NVIDIA GPUs requiring a legacy driver (no description available) (no description available) 460.27.04-1 amd64 utility to load NVIDIA kernel modules and create device nodes (no description available) (no description available) (no description available) (no description available) (no description available) 450.57-1 amd64 daemon to maintain persistent software state in the NVIDIA driver 11.1.1-3 amd64 NVIDIA Profiler for CUDA and OpenCL 450.80.02-1+b1 amd64 tool for configuring the NVIDIA graphics driver (no description available) 450.80.02-2 amd64 NVIDIA System Management Interface 20151021+12 amd64 NVIDIA binary graphics driver support files (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) (no description available) 450.80.02-2 amd64 Video Decode and Presentation API for Unix - NVIDIA driver 11.1.1-3 amd64 NVIDIA Visual Profiler for CUDA and OpenCL 450.80.02-2 amd64 NVIDIA Vulkan driver - common files 450.80.02-2 amd64 NVIDIA Vulkan installable client driver (ICD) 450.80.02-2 i386 NVIDIA Vulkan installable client driver (ICD) (no description available) 450.80.02-2 amd64 NVIDIA binary Xorg driver (no description available) dia-legacy-304xx (no description available)

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.3.1 build date: 2020-12-14T14:18+00:00 build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072 build compiler: x86_64-linux-gnu-gcc-8 8.3.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
 - [x] Docker command, image and tag used

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

DanielCeregatti commented 3 years ago

Hi,

I'm experiencing the same issue. For now I've worked around it:

In /etc/nvidia-container-runtime/config.toml I've set no-cgroups = true and now the container starts, but the nvidia devices are not added to the container. Once the devices are added the container works again.

Here are the relevant lines from my docker-compose.yml:

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

This is equivalent to docker run --device /dev/whatever ..., but I'm not sure of the exact syntax.

Hope this helps.

lissyx commented 3 years ago

This seems to be related to systemd upgrade to 247.2-2 which was uploaded to sid three weeks ago and made its way to testing now. This commit highlights the change of cgroup hierarchy: https://salsa.debian.org/systemd-team/systemd/-/commit/170fb124a32884bd9975ee4ea9e1ffbbc2ee26b4

Indeed, default setup does not expose anymore /sys/fs/cgroup/devices which libnvidia-container uses according to https://github.com/NVIDIA/libnvidia-container/blob/ac02636a318fe7dcc71eaeb3cc55d0c8541c1072/src/nvc_container.c#L379-L382

Using the documented systemd.unified_cgroup_hierarchy=false kernel command line parameter switch back the /sys/fs/cgroup/devices entry and libnvidia-container is happier.

klueska commented 3 years ago

@lissyx Thank you for printing out the crux of the issue. We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

seemethere commented 3 years ago

Wanted to also chime in to say that I'm also experiencing this on Fedora 33

mathstuf commented 3 years ago

Could the title be updated to indicate that it is systemd cgroup layout related?

klueska commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2

If these resolve this issue, please comment and close. Thanks.

super-cooper commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2

If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

regzon commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support. The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49 And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2 If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

Did you set the following parameter: systemd.unified_cgroup_hierarchy=false?

Or did you just upgrade all the packages?

super-cooper commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support. The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49 And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2 If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

Did you set the following parameter: systemd.unified_cgroup_hierarchy=false?

Or did you just upgrade all the packages?

For me it was solved by upgrading the package.

regzon commented 3 years ago

Thank you, @super-cooper, for the reply.

I am having exactly the same issue on Debian Testing even after an upgrade.

1. Issue or feature description

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

2. Steps to reproduce the issue

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

[x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info


I0130 05:23:50.494974 4486 nvc.c:282] initializing library context (version=1.3.2, build=fa9c778f687e9ac7be52b0299fa3b6ac2d9fbf93)
I0130 05:23:50.495160 4486 nvc.c:256] using root /
I0130 05:23:50.495178 4486 nvc.c:257] using ldcache /etc/ld.so.cache
I0130 05:23:50.495194 4486 nvc.c:258] using unprivileged user 1000:1000
I0130 05:23:50.495256 4486 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0130 05:23:50.495644 4486 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
W0130 05:23:50.499341 4487 nvc.c:187] failed to set inheritable capabilities
W0130 05:23:50.499369 4487 nvc.c:188] skipping kernel modules load due to failure
I0130 05:23:50.499601 4488 driver.c:101] starting driver service
I0130 05:23:50.504376 4486 nvc_info.c:680] requesting driver information with ''
I0130 05:23:50.506132 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.32.03
I0130 05:23:50.506191 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.460.32.03
I0130 05:23:50.506283 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.32.03
I0130 05:23:50.506375 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.32.03
I0130 05:23:50.506418 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.460.32.03
I0130 05:23:50.506467 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.32.03
I0130 05:23:50.506512 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.32.03
I0130 05:23:50.506557 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.32.03
I0130 05:23:50.506669 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.460.32.03
I0130 05:23:50.506714 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.460.32.03
I0130 05:23:50.507077 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.32.03
I0130 05:23:50.507376 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.460.32.03
I0130 05:23:50.507476 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.460.32.03
I0130 05:23:50.507569 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.460.32.03
I0130 05:23:50.507669 4486 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.460.32.03
W0130 05:23:50.507732 4486 nvc_info.c:350] missing library libnvidia-opencl.so
W0130 05:23:50.507741 4486 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0130 05:23:50.507748 4486 nvc_info.c:350] missing library libnvidia-allocator.so
W0130 05:23:50.507754 4486 nvc_info.c:350] missing library libnvidia-compiler.so
W0130 05:23:50.507760 4486 nvc_info.c:350] missing library libnvidia-ngx.so
W0130 05:23:50.507766 4486 nvc_info.c:350] missing library libvdpau_nvidia.so
W0130 05:23:50.507772 4486 nvc_info.c:350] missing library libnvidia-encode.so
W0130 05:23:50.507781 4486 nvc_info.c:350] missing library libnvidia-opticalflow.so
W0130 05:23:50.507788 4486 nvc_info.c:350] missing library libnvcuvid.so
W0130 05:23:50.507796 4486 nvc_info.c:350] missing library libnvidia-fbc.so
W0130 05:23:50.507806 4486 nvc_info.c:350] missing library libnvidia-ifr.so
W0130 05:23:50.507815 4486 nvc_info.c:350] missing library libnvoptix.so
W0130 05:23:50.507823 4486 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0130 05:23:50.507832 4486 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0130 05:23:50.507848 4486 nvc_info.c:354] missing compat32 library libcuda.so
W0130 05:23:50.507859 4486 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0130 05:23:50.507869 4486 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0130 05:23:50.507880 4486 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0130 05:23:50.507889 4486 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0130 05:23:50.507897 4486 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0130 05:23:50.507906 4486 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0130 05:23:50.507915 4486 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0130 05:23:50.507925 4486 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0130 05:23:50.507933 4486 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0130 05:23:50.507942 4486 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0130 05:23:50.507950 4486 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0130 05:23:50.507960 4486 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0130 05:23:50.507970 4486 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0130 05:23:50.507979 4486 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0130 05:23:50.507988 4486 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0130 05:23:50.507998 4486 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0130 05:23:50.508007 4486 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0130 05:23:50.508015 4486 nvc_info.c:354] missing compat32 library libnvoptix.so
W0130 05:23:50.508025 4486 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0130 05:23:50.508031 4486 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0130 05:23:50.508040 4486 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0130 05:23:50.508050 4486 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0130 05:23:50.508060 4486 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0130 05:23:50.508068 4486 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0130 05:23:50.508515 4486 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-smi
I0130 05:23:50.508580 4486 nvc_info.c:276] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0130 05:23:50.508612 4486 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
W0130 05:23:50.509049 4486 nvc_info.c:376] missing binary nvidia-cuda-mps-control
W0130 05:23:50.509060 4486 nvc_info.c:376] missing binary nvidia-cuda-mps-server
I0130 05:23:50.509100 4486 nvc_info.c:438] listing device /dev/nvidiactl
I0130 05:23:50.509109 4486 nvc_info.c:438] listing device /dev/nvidia-uvm
I0130 05:23:50.509118 4486 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0130 05:23:50.509127 4486 nvc_info.c:438] listing device /dev/nvidia-modeset
I0130 05:23:50.509168 4486 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W0130 05:23:50.509192 4486 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0130 05:23:50.509200 4486 nvc_info.c:745] requesting device information with ''
I0130 05:23:50.516712 4486 nvc_info.c:628] listing device /dev/nvidia0 (GPU-6064a007-a943-7f11-1ad7-12ac87046652 at 00000000:01:00.0)
NVRM version:   460.32.03
CUDA version:   11.2

Device Index: 0 Device Minor: 0 Model: GeForce GTX 960M Brand: GeForce GPU UUID: GPU-6064a007-a943-7f11-1ad7-12ac87046652 Bus Location: 00000000:01:00.0 Architecture: 5.0 I0130 05:23:50.516775 4486 nvc.c:337] shutting down library context I0130 05:23:50.517704 4488 driver.c:156] terminating driver service I0130 05:23:50.518087 4486 driver.c:196] driver service terminated successfully

 - [x] Kernel version from `uname -a`

Linux stas 5.10.0-2-amd64 #1 SMP Debian 5.10.9-1 (2021-01-20) x86_64 GNU/Linux

 - [x] Any relevant kernel output lines from `dmesg`

[ 487.597570] docker0: port 1(vethb7a49e6) entered blocking state [ 487.597573] docker0: port 1(vethb7a49e6) entered disabled state [ 487.597786] device vethb7a49e6 entered promiscuous mode [ 487.773120] docker0: port 1(vethb7a49e6) entered disabled state [ 487.776548] device vethb7a49e6 left promiscuous mode [ 487.776556] docker0: port 1(vethb7a49e6) entered disabled state

 - [x] Driver information from `nvidia-smi -a`

Timestamp : Sat Jan 30 08:26:51 2021 Driver Version : 460.32.03 CUDA Version : 11.2

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce GTX 960M Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-6064a007-a943-7f11-1ad7-12ac87046652 Minor Number : 0 VBIOS Version : 82.07.82.00.10 MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x139B10DE Bus Id : 00000000:01:00.0 Sub System Id : 0x380217AA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 4046 MiB Used : 4 MiB Free : 4042 MiB BAR1 Memory Usage Total : 256 MiB Used : 1 MiB Free : 255 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 33 C GPU Shutdown Temp : 101 C GPU Slowdown Temp : 96 C GPU Max Operating Temp : 92 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : N/A Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 135 MHz SM : 135 MHz Memory : 405 MHz Video : 405 MHz Applications Clocks Graphics : 1097 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 1097 MHz Memory : 2505 MHz Max Clocks Graphics : 1202 MHz SM : 1202 MHz Memory : 2505 MHz Video : 1081 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1351 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 2 MiB

 - [x] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.2 API version: 1.41 Go version: go1.13.15 Git commit: 2291f61 Built: Mon Dec 28 16:17:34 2020 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8891c58 Built: Mon Dec 28 16:15:28 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b runc: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-==============================-============-================================================================= un bumblebee-nvidia (no description available) ii glx-alternative-nvidia 1.2.0 amd64 allows the selection of NVIDIA as GLX provider un libegl-nvidia-legacy-390xx0 (no description available) un libegl-nvidia-tesla-418-0 (no description available) un libegl-nvidia-tesla-440-0 (no description available) un libegl-nvidia-tesla-450-0 (no description available) ii libegl-nvidia0:amd64 460.32.03-1 amd64 NVIDIA binary EGL library un libegl1-glvnd-nvidia (no description available) un libegl1-nvidia (no description available) un libgl1-glvnd-nvidia-glx (no description available) ii libgl1-nvidia-glvnd-glx:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant) un libgl1-nvidia-glx (no description available) un libgl1-nvidia-glx-any (no description available) un libgl1-nvidia-glx-i386 (no description available) un libgl1-nvidia-legacy-390xx-glx (no description available) un libgl1-nvidia-tesla-418-glx (no description available) un libgldispatch0-nvidia (no description available) ii libgles-nvidia1:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL|ES 1.x library ii libgles-nvidia2:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL|ES 2.x library un libgles1-glvnd-nvidia (no description available) un libgles2-glvnd-nvidia (no description available) un libglvnd0-nvidia (no description available) ii libglx-nvidia0:amd64 460.32.03-1 amd64 NVIDIA binary GLX library un libglx0-glvnd-nvidia (no description available) ii libnvidia-cbl:amd64 460.32.03-1 amd64 NVIDIA binary Vulkan ray tracing (cbl) library un libnvidia-cbl-460.32.03 (no description available) un libnvidia-cfg.so.1 (no description available) ii libnvidia-cfg1:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) ii libnvidia-container-tools 1.3.2-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.3.2-1 amd64 NVIDIA container runtime library ii libnvidia-eglcore:amd64 460.32.03-1 amd64 NVIDIA binary EGL core libraries un libnvidia-eglcore-460.32.03 (no description available) ii libnvidia-glcore:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL/GLX core libraries un libnvidia-glcore-460.32.03 (no description available) ii libnvidia-glvkspirv:amd64 460.32.03-1 amd64 NVIDIA binary Vulkan Spir-V compiler library un libnvidia-glvkspirv-460.32.03 (no description available) un libnvidia-legacy-340xx-cfg1 (no description available) un libnvidia-legacy-390xx-cfg1 (no description available) un libnvidia-ml.so.1 (no description available) ii libnvidia-ml1:amd64 460.32.03-1 amd64 NVIDIA Management Library (NVML) runtime library ii libnvidia-ptxjitcompiler1:amd64 460.32.03-1 amd64 NVIDIA PTX JIT Compiler ii libnvidia-rtcore:amd64 460.32.03-1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library un libnvidia-rtcore-460.32.03 (no description available) un libnvidia-tesla-418-cfg1 (no description available) un libnvidia-tesla-440-cfg1 (no description available) un libnvidia-tesla-450-cfg1 (no description available) un libopengl0-glvnd-nvidia (no description available) ii nvidia-alternative 460.32.03-1 amd64 allows the selection of NVIDIA as GLX provider un nvidia-alternative--kmod-alias (no description available) un nvidia-alternative-legacy-173xx (no description available) un nvidia-alternative-legacy-71xx (no description available) un nvidia-alternative-legacy-96xx (no description available) ii nvidia-container-runtime 3.4.1-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.4.1-1 amd64 NVIDIA container runtime hook un nvidia-cuda-mps (no description available) un nvidia-current (no description available) un nvidia-current-updates (no description available) ii nvidia-detect 460.32.03-1 amd64 NVIDIA GPU detection utility un nvidia-docker (no description available) ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper ii nvidia-driver 460.32.03-1 amd64 NVIDIA metapackage un nvidia-driver-any (no description available) ii nvidia-driver-bin 460.32.03-1 amd64 NVIDIA driver support binaries un nvidia-driver-bin-460.32.03 (no description available) un nvidia-driver-binary (no description available) ii nvidia-driver-libs:amd64 460.32.03-1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) un nvidia-driver-libs-any (no description available) un nvidia-driver-libs-nonglvnd (no description available) ii nvidia-egl-common 460.32.03-1 amd64 NVIDIA binary EGL driver - common files ii nvidia-egl-icd:amd64 460.32.03-1 amd64 NVIDIA EGL installable client driver (ICD) un nvidia-glx-any (no description available) ii nvidia-installer-cleanup 20151021+13 amd64 cleanup after driver installation with the nvidia-installer un nvidia-kernel-460.32.03 (no description available) ii nvidia-kernel-common 20151021+13 amd64 NVIDIA binary kernel module support files ii nvidia-kernel-dkms 460.32.03-1 amd64 NVIDIA binary kernel module DKMS source un nvidia-kernel-source (no description available) ii nvidia-kernel-support 460.32.03-1 amd64 NVIDIA binary kernel module support files un nvidia-kernel-support--v1 (no description available) un nvidia-kernel-support-any (no description available) un nvidia-legacy-304xx-alternative (no description available) un nvidia-legacy-304xx-driver (no description available) un nvidia-legacy-340xx-alternative (no description available) un nvidia-legacy-340xx-vdpau-driver (no description available) un nvidia-legacy-390xx-vdpau-driver (no description available) un nvidia-legacy-390xx-vulkan-icd (no description available) ii nvidia-legacy-check 460.32.03-1 amd64 check for NVIDIA GPUs requiring a legacy driver un nvidia-libopencl1-dev (no description available) ii nvidia-modprobe 460.32.03-1 amd64 utility to load NVIDIA kernel modules and create device nodes un nvidia-nonglvnd-vulkan-common (no description available) un nvidia-nonglvnd-vulkan-icd (no description available) un nvidia-opencl-icd (no description available) ii nvidia-openjdk-8-jre 9.+8u272-b10-0+deb9u1~11.1.1-4 amd64 Obsolete OpenJDK Java runtime, for NVIDIA applications ii nvidia-persistenced 460.32.03-1 amd64 daemon to maintain persistent software state in the NVIDIA driver un nvidia-settings (no description available) ii nvidia-smi 460.32.03-1 amd64 NVIDIA System Management Interface ii nvidia-support 20151021+13 amd64 NVIDIA binary graphics driver support files un nvidia-tesla-418-vdpau-driver (no description available) un nvidia-tesla-418-vulkan-icd (no description available) un nvidia-tesla-440-vdpau-driver (no description available) un nvidia-tesla-440-vulkan-icd (no description available) un nvidia-tesla-450-vulkan-icd (no description available) un nvidia-tesla-alternative (no description available) ii nvidia-vdpau-driver:amd64 460.32.03-1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver ii nvidia-vulkan-common 460.32.03-1 amd64 NVIDIA Vulkan driver - common files ii nvidia-vulkan-icd:amd64 460.32.03-1 amd64 NVIDIA Vulkan installable client driver (ICD) un nvidia-vulkan-icd-any (no description available) ii xserver-xorg-video-nvidia 460.32.03-1 amd64 NVIDIA binary Xorg driver un xserver-xorg-video-nvidia-any (no description available) un xserver-xorg-video-nvidia-legacy-304xx (no description available)

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.3.2 build date: 2021-01-25T11:07+00:00 build revision: fa9c778f687e9ac7be52b0299fa3b6ac2d9fbf93 build compiler: x86_64-linux-gnu-gcc-8 8.3.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [x] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
`/var/log/nvidia-container-toolkit.log` is not generated.
 - [x] Docker command, image and tag used

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi



@klueska Could you please check the issue?

elezar commented 3 years ago

@regzon thanks for indicating that this is still and issue. Could you please check what your systemd cgroup configuration is? (see for example this other issue which shows similar behaviour: https://github.com/docker/cli/issues/2104#issuecomment-535560873)

klueska commented 3 years ago

@regzon your issue is likely related to the fact that libnvidia-container does not support cgroups v2.

You will need to follow the suggestion in the comments above for https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-760059332 to force systemd to use v1 cgroups.

In any case -- we do not officially support Debian Testing nor cgroups v2 (yet).

regzon commented 3 years ago

@elezar @klueska thank you for your help. When forcing the systemd to not use the unified hierarchy, everything works fine. I thought that the latest libnvidia-container upgrade would resolve the issue (as it did for @super-cooper). But if the upgrade is not intended to fix the issue with cgroups, then everything is fine.

flixr commented 3 years ago

@klueska I'm having the same "issue", i.e. missing support for cgroups v2 (which I would very much like for other reasons). Is there already an issue for this to track?

klueska commented 3 years ago

We are not planning on building support for cgroups v2 into the existing nvidia-docker stack.

Please see my comment above for more info: https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-760189260

flixr commented 3 years ago

Let me rephrase it then: I want to use nvidia-docker on a system where cgroup v2 is enabled (systemd.unified_cgroup_hierarchy=true). Right now this is not working and this bug is closed. So is there an issue that I can track to know when I can use nvidia-docker on hosts with cgroup v2 enabled?

klueska commented 3 years ago

We have it tracked in our internal JIRA with a link to this this issue as the location to report once the work is complete: https://github.com/NVIDIA/libnvidia-container/issues/111

jelmd commented 3 years ago

facebook oomd requires cgroup v2, i.e. systemd.unified_cgroup_hierarchy=1. So either users freeze the boxes pretty often and render them unusable, or they cannot use nvidia-containers. Both is crap. We will probably drop the nvidia-docker non-sense.

4n0m4l0u5 commented 3 years ago

For Debian users, you can disable cgroup hierarchy by editing /etc/default/grub and adding systemd.unified_cgroup_hierarchy=0 to the end of the GRUB_CMDLINE_LINUX_DEFAULT options. Example: ... GRUB_CMDLINE_LINUX_DEFAULT="quiet systemd.unified_cgroup_hierarchy=0" ...

Then run update-grub and reboot for changes to take effect.

It's worth noting that I also had to modify /etc/nvidia-container-runtime/config.toml to remove the '@' symbol and update to the correct location of ldconfig for my system (Debian Unstable). eg: ldconfig = "/usr/sbin/ldconfig"

This worked for me, I hope this saves someone else some time.

Zethson commented 3 years ago

Fix on Arch:

Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual.

gabrielebaris commented 3 years ago

@Zethson I also use Arch and yesterday I followed your suggestion. It seemed to work (I was able to start the containers), but running nvidia-smi I had no accesso to my GPU from inside docker. Reading the other answers in this issue, I solved by adding systemd.unified_cgroup_hierarchy=0 to boot kernel parameters and commenting again the entry no-cgroups in /etc/nvidia-container-runtime/config.toml

wernight commented 3 years ago

Arch has now cgroup v2 enabled by default, so it'd be useful to plan for supporting it.

adam505hq commented 3 years ago

Fix on Arch:

Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual.

Awesome this works well.

biggs commented 3 years ago

Fix on NixOS (where cgroup v2 is also now default): add systemd.enableUnifiedCgroupHierarchy = false; and restart.

prismplex commented 3 years ago

This worked for me on Manjaro Linux (Arch Linux as base) without deactivating cgroup v2: Create the folder docker.service.d under /etc/systemd/system, create file override.conf in this folder:

[Service]
ExecStartPre=-/usr/bin/nvidia-modprobe -c 0 -u

After that you have to add the following content to your docker-compose.yml, thank you @DanielCeregatti :

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

Background: The nvidia-uvm and nvidia-uvm-tools folder did not exist unter /dev for me. After running nvidia-modprobe -c 0 -u they appeared but disappeared after reboot. This workaround adds these folders before docker starts. Unfortunately I don`t know why these folders do not exist by default. Maybe somebody can complement. Currently using Linux 5.12. Maybe it has to do with this kernel version.

Edit: This workaround works only if the container using NVIDIA is restarted afterwards. I do not know why, but if not, the container starts, but cannot access the created directories.

Update 25.06.2021: Found out why I had to restart jellyfin. Docker started before my disks were online. If somebody has this problem too, here is the fix: https://github.com/openmediavault/openmediavault/issues/458#issuecomment-628076472

nihil21 commented 3 years ago

After that you have to add the following content to your docker-compose.yml, thank you @DanielCeregatti :
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

Hi, I'm running Manjaro and facing the same issue: when I run the container using docker run (e.g. docker run -it --gpus all --privileged -v /dev:/dev --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))") it works, but I did not manage to make it work with docker-compose up.

Could you please post a complete, working docker-compose.yml file? Thank you very much!

nihil21 commented 3 years ago

Never mind, I have just managed to make it work with docker-compose. I'll post here a minimal working example:

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;print(tf.config.list_physical_devices('GPU'))"
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: [gpu]

japm48 commented 3 years ago

Minimal working example on Arch with nvidia-container-toolkit (from AUR) installed:

docker run --rm --gpus all \
  --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl  \
  nvidia/cuda:11.0-base nvidia-smi

Without the --devices I get this unhelpful message: Failed to initialize NVML: Unknown Error.

Edit: also make sure you have no-cgroups = true in /etc/nvidia-container-runtime/config.toml (thanks @mpizenberg)

klueska commented 3 years ago

https://github.com/NVIDIA/nvidia-docker/issues/1549#issuecomment-939943090

mpizenberg commented 3 years ago

Minimal working example on Arch with nvidia-container-toolkit (from AUR) installed: ... Without the --devices I get this unhelpful message: Failed to initialize NVML: Unknown Error.

@japm48 may I ask what changes did you do exactly to have that command work? Did you also do the systemd.unified_cgroup_hierarchy=false kernel parameter change and the no-cgroups = false nvidia config change?

Without doing those, I'm on Arch with kernel 5.14.14, with version 1.5.1-1 of the aur/nvidia-container-toolkit and when running the command:

docker run --rm --gpus all \
  --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl  \
  nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi

I get

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

Same if I change no-cgroup config to false. I haven't tried changing my kernel parameters though I'd like to avoid that!

EDIT: it now works with some changes

Ok I actually got it working on my system with the following setup:

Do not change the systemd.unified_cgroup_hierarchy kernel parameter
Set no-cgroups = true in /etc/nvidia-container-runtime/config.toml
Add the --device params to the docker run command as follows:

docker run --rm --gpus all \
         --device /dev/nvidia0 --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl \
         nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi

TaridaGeorge commented 2 years ago

After settings no-cgroup to true I get this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

OS: debian 11

ninnghazad commented 2 years ago

After settings no-cgroup to true I get this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

OS: debian 11

https://github.com/NVIDIA/nvidia-docker/issues/1163#issuecomment-824775675

ldconfig = "/sbin/ldconfig.real"

this worked for me on debian 11.

japm48 commented 2 years ago

@mpizenberg

@japm48 may I ask what changes did you do exactly to have that command work? Did you also do the systemd.unified_cgroup_hierarchy=false kernel parameter change and the no-cgroups = false nvidia config change?

I'm really sorry I didn't see your message.

I had no-cgroups = true in /etc/nvidia-container-runtime/config.toml, but I didn't modify the file. This is likely because, as I did a fresh install, I got the patched config file; and I guess you had the previous (unpatched) version installed, so it wasn't overwritten on update.

chenhengqi commented 2 years ago

@lissyx Thank you for printing out the crux of the issue. We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

@klueska It's been 11 months, any updates on this rearchitecting :)

klueska commented 2 years ago

The rearchitecture work has been slower that we hoped, but (somewhat because of this), we have now built support for cgroupv2 in libnvidia-container and it is currently under review. We hope to have an RC out before christmas.

Here is the MR chain: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/113 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/114 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/115 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/116 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/117

klueska commented 2 years ago

We now have an RC of libnvidia-container out that adds support for cgroupv2.

If you would like to try it out, make sure and add the experimental repo to your apt sources and install the latest packages:

For DEBs

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update

sudo apt-get install -y libnvidia-container-tools libnvidia-container1

For RPMs

sudo yum-config-manager --enable libnvidia-container-experimental

sudo yum install -y libnvidia-container-tools libnvidia-container1

euri10 commented 2 years ago

was previously using the systemd.unified_cgroup_hierarchy=false kernel command line parameter on debian bullseye and removed it after I upgraded to 1.8.0rc1 and every container that is using my gpu seems to be working perfectly fine so far

thanks !

the only minor diff I have with the instructions above is that my repo sources is called nvidia-docker.list and not libnvidia-container.list not sure why

klueska commented 2 years ago

Yes, that may be true for many users and I should have pointed that out.

We used to host packages across three different repos and recently consolidated down to just 1 (i.e. libnvidia-container). The changes were mostly transparent, but I can see how instructions for enabling the experimental repo may need to be tweaked depending on which repos you actually have configured.

Basically we used to package binaries and host packages as seen in the table below:

Binary	Package	Repo
nvidia-docker	nvidia-docker2	nvidia.github.io/nvidia-docker
nvidia-container-runtime	nvidia-container-runtime	nvidia.github.io/nvidia-container-runtime
nvidia-container-toolkit	nvidia-container-toolkit	nvidia.github.io/nvidia-container-runtime
nvidia-container-cli	libnvidia-container-tools	nvidia.github.io/libnvidia-container
libnvidia-container.so.1	libnvidia-container1	nvidia.github.io/libnvidia-container

But that changed recently to:

Binary	Package	Repo
nvidia-docker	nvidia-docker2	nvidia.github.io/libnvidia-container
nvidia-container-runtime	nvidia-container-toolkit	nvidia.github.io/libnvidia-container
nvidia-container-toolkit	nvidia-container-toolkit	nvidia.github.io/libnvidia-container
nvidia-container-cli	libnvidia-container-tools	nvidia.github.io/libnvidia-container
libnvidia-container.so.1	libnvidia-container1	nvidia.github.io/libnvidia-container

So nowadays all you actually need is libnvidia-contianer.list to get access to all of new packages, but if you nvidia-docker.list that is still OK because it also contains entries for all of the repos listed in libnvidia-contianer.list(it just contains entries for more -- now unnecessary -- repos as well).

alexcpn commented 2 years ago

Pop OS till now did not have this problem; But with the latest update 21.1 I got this error

Setting no-cgroups = true in /etc/nvidia-container-runtime/config.toml made the docker start; but TensorFlow print GPU returned zero. Also tried with Podman,

sudo podman run  -e NVIDIA_VISIBLE_DEVICES=0   -it --network host -v /home/alex/coding:/tf/notebooks docker.io/tensorflow/tensorflow:latest-gpu-jupyter

but TensorFlow print GPU returned zero.

But switching to CgroupsV1 via the kernel parameter worked; For anyone else who is using PopOS which used systemd-boot instead of Grub the below commands may help

sudo kernelstub -a "systemd.unified_cgroup_hierarchy=0"
sudo update-initramfs -c -k all
reboot

Error string before this

xx@pop-os:~/coding/cnn_1/cnn_py$ docker run  --gpus device=0  -it --network host -v /home/alex/coding―tf/notebooks tensorflow/tensorflow:latest-gpu-jupyter
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled 
alex@pop-os:~/coding/cnn_1/cnn_py$ apt list  nvidia-container-toolkit 
Listing... Done
nvidia-container-toolkit/now 1.5.1-1pop1~1627998766~21.04~9847cf2 amd64 [installed,local]

After restart

More details https://medium.com/nttlabs/cgroup-v2-596d035be4d7

muhark commented 2 years ago

For those who are here after upgrading to Ubuntu 21.10 (not supported), using the experimental version of the 18.04 and reinstalling libnvidia-container-tools and libnvidia-container1 works. (Don't forget to restart docker afterwards).

Thank you for all your amazing work!

ljburtz commented 2 years ago

@muhark I have the same issue on 21.10. which versions did you install? which commands did you use for reinstalling/experimental version that you mention fixed it? any help much appreciated!

muhark commented 2 years ago

@ljburtz, I used the command by @klueska above:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list

Then my /etc/apt/sources.list.d/nvidia-docker.list looks like the following:

#deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /

And finally I reinstalled (since I'd already been bungling the existing installations).

sudo apt-get update
sudo apt-get install --reinstall libnvidia-container-tools libnividia-container1

and finally tested with:

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

FWIW, am using a GTX 3070.

ljburtz commented 2 years ago

fantastic this works on GTX 3060 / ubuntu21.10 major thanks for replying so fast @muhark

Frikster commented 2 years ago

If you need cgroups active so cannot do no-cgroups = true and you're on PopOS 21.10, As per this explanation this one command fixed this issue for me while keeping cgroups on:

sudo kernelstub -a systemd.unified_cgroup_hierarchy=0

I then had to reboot and the issue is gone.

klueska commented 2 years ago

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.

Please see https://github.com/NVIDIA/libnvidia-container/issues/111#issuecomment-989024375 for instructions on how to get access to this RC (or wait for the full release at the end of next week).

Note: This does not directly add debian testing support, but you can point to the debian10 repo and install from there for now.

jbcpollak commented 2 years ago

This may be useful for Ubuntu users running into this issue:

So nowadays all you actually need is libnvidia-contianer.list to get access to all of new packages, but if you nvidia-docker.list that is still OK because it also contains entries for all of the repos listed in libnvidia-contianer.list(it just contains entries for more -- now unnecessary -- repos as well).

@klueska , I just wanted to mention when I go to the following URLs:

https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list https://nvidia.github.io/nvidia-docker/ubuntu20.04/nvidia-docker.list

I get a valid apt list in response.

But if I visit:

https://nvidia.github.io/nvidia-docker/ubuntu18.04/libnvidia-container.list https://nvidia.github.io/nvidia-docker/ubuntu20.04/libnvidia-container.list

I get # Unsupported distribution! # Check https://nvidia.github.io/nvidia-docker.

It appears the list has been moved back to the original filename?

klueska commented 2 years ago

These: https://nvidia.github.io/nvidia-docker/ubuntu18.04/libnvidia-container.list https://nvidia.github.io/nvidia-docker/ubuntu20.04/libnvidia-container.list

Should be: https://nvidia.github.io/libnvidia-container/ubuntu18.04/libnvidia-container.list https://nvidia.github.io/libnvidia-container/ubuntu20.04/libnvidia-container.list

jbcpollak commented 2 years ago

ah, :facepalm: , much appreciated, thanks for making it explicit.

klueska commented 2 years ago

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0

klueska commented 2 years ago

Debian 11 support has now been added such that running the following should now work as expected:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

NVIDIA / nvidia-docker