NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.25k stars 2.03k forks source link

cgroup issue with nvidia container runtime on Debian testing #1447

Closed super-cooper closed 3 years ago

super-cooper commented 3 years ago

1. Issue or feature description

Whenever I try to build or run an NVidia container, Docker fails with the error message:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

2. Steps to reproduce the issue

$ docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

Device Index: 0 Device Minor: 0 Model: GeForce GTX 980 Ti Brand: GeForce GPU UUID: GPU-6518be5e-14ff-e277-21aa-73b482890bee Bus Location: 00000000:07:00.0 Architecture: 5.2 I0107 20:43:11.947903 36435 nvc.c:337] shutting down library context I0107 20:43:11.948696 36437 driver.c:156] terminating driver service I0107 20:43:11.949026 36435 driver.c:196] driver service terminated successfully

 - [x] Kernel version from `uname -a`

Linux lambda 5.8.0-3-amd64 #1 SMP Debian 5.8.14-1 (2020-10-10) x86_64 GNU/Linux

 - [ ] Any relevant kernel output lines from `dmesg`
 - [x] Driver information from `nvidia-smi -a`

Thu Jan 7 15:45:08 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 980 Ti On | 00000000:07:00.0 On | N/A | | 0% 45C P5 29W / 250W | 403MiB / 6083MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3023 G /usr/lib/xorg/Xorg 177MiB | | 0 N/A N/A 4833 G /usr/bin/gnome-shell 166MiB | | 0 N/A N/A 7609 G ...AAAAAAAAA= --shared-files 54MiB | +-----------------------------------------------------------------------------+

 - [x] Docker version from `docker version`

Server: Docker Engine - Community Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8891c58 Built: Mon Dec 28 16:15:28 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b nvidia: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-==============-============-================================================================= un bumblebee-nvidia (no description available) ii glx-alternative-nvidia 1.2.0 amd64 allows the selection of NVIDIA as GLX provider un libegl-nvidia-legacy-390xx0 (no description available) un libegl-nvidia-tesla-418-0 (no description available) un libegl-nvidia-tesla-440-0 (no description available) un libegl-nvidia-tesla-450-0 (no description available) ii libegl-nvidia0:amd64 450.80.02-2 amd64 NVIDIA binary EGL library ii libegl-nvidia0:i386 450.80.02-2 i386 NVIDIA binary EGL library un libegl1-glvnd-nvidia (no description available) un libegl1-nvidia (no description available) un libgl1-glvnd-nvidia-glx (no description available) ii libgl1-nvidia-glvnd-glx:amd64 450.80.02-2 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant) ii libgl1-nvidia-glvnd-glx:i386 450.80.02-2 i386 NVIDIA binary OpenGL/GLX library (GLVND variant) un libgl1-nvidia-glx (no description available) un libgl1-nvidia-glx-any (no description available) un libgl1-nvidia-glx-i386 (no description available) un libgl1-nvidia-legacy-390xx-glx (no description available) un libgl1-nvidia-tesla-418-glx (no description available) un libgldispatch0-nvidia (no description available) ii libgles-nvidia1:amd64 450.80.02-2 amd64 NVIDIA binary OpenGL|ES 1.x library ii libgles-nvidia1:i386 450.80.02-2 i386 NVIDIA binary OpenGL|ES 1.x library ii libgles-nvidia2:amd64 450.80.02-2 amd64 NVIDIA binary OpenGL|ES 2.x library ii libgles-nvidia2:i386 450.80.02-2 i386 NVIDIA binary OpenGL|ES 2.x library un libgles1-glvnd-nvidia (no description available) un libgles2-glvnd-nvidia (no description available) un libglvnd0-nvidia (no description available) ii libglx-nvidia0:amd64 450.80.02-2 amd64 NVIDIA binary GLX library ii libglx-nvidia0:i386 450.80.02-2 i386 NVIDIA binary GLX library un libglx0-glvnd-nvidia (no description available) un libnvidia-cbl (no description available) un libnvidia-cfg.so.1 (no description available) ii libnvidia-cfg1:amd64 450.80.02-2 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) ii libnvidia-container-tools 1.3.1-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.3.1-1 amd64 NVIDIA container runtime library ii libnvidia-eglcore:amd64 450.80.02-2 amd64 NVIDIA binary EGL core libraries ii libnvidia-eglcore:i386 450.80.02-2 i386 NVIDIA binary EGL core libraries un libnvidia-eglcore-450.80.02 (no description available) ii libnvidia-encode1:amd64 450.80.02-2 amd64 NVENC Video Encoding runtime library ii libnvidia-glcore:amd64 450.80.02-2 amd64 NVIDIA binary OpenGL/GLX core libraries ii libnvidia-glcore:i386 450.80.02-2 i386 NVIDIA binary OpenGL/GLX core libraries un libnvidia-glcore-450.80.02 (no description available) ii libnvidia-glvkspirv:amd64 450.80.02-2 amd64 NVIDIA binary Vulkan Spir-V compiler library ii libnvidia-glvkspirv:i386 450.80.02-2 i386 NVIDIA binary Vulkan Spir-V compiler library un libnvidia-glvkspirv-450.80.02 (no description available) un libnvidia-legacy-340xx-cfg1 (no description available) un libnvidia-legacy-390xx-cfg1 (no description available) ii libnvidia-ml-dev:amd64 11.1.1-3 amd64 NVIDIA Management Library (NVML) development files un libnvidia-ml.so.1 (no description available) ii libnvidia-ml1:amd64 450.80.02-2 amd64 NVIDIA Management Library (NVML) runtime library ii libnvidia-ptxjitcompiler1:amd64 450.80.02-2 amd64 NVIDIA PTX JIT Compiler ii libnvidia-rtcore:amd64 450.80.02-2 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library un libnvidia-rtcore-450.80.02 (no description available) un libnvidia-tesla-418-cfg1 (no description available) un libnvidia-tesla-440-cfg1 (no description available) un libnvidia-tesla-450-cfg1 (no description available) un libnvidia-tesla-450-cuda1 (no description available) un libnvidia-tesla-450-ml1 (no description available) un libopengl0-glvnd-nvidia (no description available) ii nvidia-alternative 450.80.02-2 amd64 allows the selection of NVIDIA as GLX provider un nvidia-alternative--kmod-alias (no description available) un nvidia-alternative-legacy-173xx (no description available) un nvidia-alternative-legacy-71xx (no description available) un nvidia-alternative-legacy-96xx (no description available) ii nvidia-container-runtime 3.4.0-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.4.0-1 amd64 NVIDIA container runtime hook ii nvidia-cuda-dev:amd64 11.1.1-3 amd64 NVIDIA CUDA development files un nvidia-cuda-doc (no description available) ii nvidia-cuda-gdb 11.1.1-3 amd64 NVIDIA CUDA Debugger (GDB) un nvidia-cuda-mps (no description available) ii nvidia-cuda-toolkit 11.1.1-3 amd64 NVIDIA CUDA development toolkit ii nvidia-cuda-toolkit-doc 11.1.1-3 all NVIDIA CUDA and OpenCL documentation un nvidia-current (no description available) un nvidia-current-updates (no description available) un nvidia-docker (no description available) ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper ii nvidia-driver 450.80.02-2 amd64 NVIDIA metapackage un nvidia-driver-any (no description available) ii nvidia-driver-bin 450.80.02-2 amd64 NVIDIA driver support binaries un nvidia-driver-bin-450.80.02 (no description available) un nvidia-driver-binary (no description available) ii nvidia-driver-libs:amd64 450.80.02-2 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) ii nvidia-driver-libs:i386 450.80.02-2 i386 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) un nvidia-driver-libs-any (no description available) un nvidia-driver-libs-nonglvnd (no description available) ii nvidia-egl-common 450.80.02-2 amd64 NVIDIA binary EGL driver - common files ii nvidia-egl-icd:amd64 450.80.02-2 amd64 NVIDIA EGL installable client driver (ICD) ii nvidia-egl-icd:i386 450.80.02-2 i386 NVIDIA EGL installable client driver (ICD) un nvidia-glx-any (no description available) ii nvidia-installer-cleanup 20151021+12 amd64 cleanup after driver installation with the nvidia-installer un nvidia-kernel-450.80.02 (no description available) ii nvidia-kernel-common 20151021+12 amd64 NVIDIA binary kernel module support files ii nvidia-kernel-dkms 450.80.02-2 amd64 NVIDIA binary kernel module DKMS source un nvidia-kernel-source (no description available) ii nvidia-kernel-support 450.80.02-2 amd64 NVIDIA binary kernel module support files un nvidia-kernel-support--v1 (no description available) un nvidia-kernel-support-any (no description available) un nvidia-legacy-304xx-alternative (no description available) un nvidia-legacy-304xx-driver (no description available) un nvidia-legacy-340xx-alternative (no description available) un nvidia-legacy-340xx-vdpau-driver (no description available) un nvidia-legacy-390xx-vdpau-driver (no description available) un nvidia-legacy-390xx-vulkan-icd (no description available) ii nvidia-legacy-check 450.80.02-2 amd64 check for NVIDIA GPUs requiring a legacy driver un nvidia-libopencl1 (no description available) un nvidia-libopencl1-dev (no description available) ii nvidia-modprobe 460.27.04-1 amd64 utility to load NVIDIA kernel modules and create device nodes un nvidia-nonglvnd-vulkan-common (no description available) un nvidia-nonglvnd-vulkan-icd (no description available) un nvidia-opencl-dev (no description available) un nvidia-opencl-icd (no description available) un nvidia-openjdk-8-jre (no description available) ii nvidia-persistenced 450.57-1 amd64 daemon to maintain persistent software state in the NVIDIA driver ii nvidia-profiler 11.1.1-3 amd64 NVIDIA Profiler for CUDA and OpenCL ii nvidia-settings 450.80.02-1+b1 amd64 tool for configuring the NVIDIA graphics driver un nvidia-settings-gtk-450.80.02 (no description available) ii nvidia-smi 450.80.02-2 amd64 NVIDIA System Management Interface ii nvidia-support 20151021+12 amd64 NVIDIA binary graphics driver support files un nvidia-tesla-418-vdpau-driver (no description available) un nvidia-tesla-418-vulkan-icd (no description available) un nvidia-tesla-440-vdpau-driver (no description available) un nvidia-tesla-440-vulkan-icd (no description available) un nvidia-tesla-450-driver (no description available) un nvidia-tesla-450-vulkan-icd (no description available) un nvidia-tesla-alternative (no description available) ii nvidia-vdpau-driver:amd64 450.80.02-2 amd64 Video Decode and Presentation API for Unix - NVIDIA driver ii nvidia-visual-profiler 11.1.1-3 amd64 NVIDIA Visual Profiler for CUDA and OpenCL ii nvidia-vulkan-common 450.80.02-2 amd64 NVIDIA Vulkan driver - common files ii nvidia-vulkan-icd:amd64 450.80.02-2 amd64 NVIDIA Vulkan installable client driver (ICD) ii nvidia-vulkan-icd:i386 450.80.02-2 i386 NVIDIA Vulkan installable client driver (ICD) un nvidia-vulkan-icd-any (no description available) ii xserver-xorg-video-nvidia 450.80.02-2 amd64 NVIDIA binary Xorg driver un xserver-xorg-video-nvidia-any (no description available) un xserver-xorg-video-nvidia-legacy-304xx (no description available)

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.3.1 build date: 2020-12-14T14:18+00:00 build revision: ac02636a318fe7dcc71eaeb3cc55d0c8541c1072 build compiler: x86_64-linux-gnu-gcc-8 8.3.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
 - [x] Docker command, image and tag used

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

DanielCeregatti commented 3 years ago

Hi,

I'm experiencing the same issue. For now I've worked around it:

In /etc/nvidia-container-runtime/config.toml I've set no-cgroups = true and now the container starts, but the nvidia devices are not added to the container. Once the devices are added the container works again.

Here are the relevant lines from my docker-compose.yml:

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

This is equivalent to docker run --device /dev/whatever ..., but I'm not sure of the exact syntax.

Hope this helps.

lissyx commented 3 years ago

This seems to be related to systemd upgrade to 247.2-2 which was uploaded to sid three weeks ago and made its way to testing now. This commit highlights the change of cgroup hierarchy: https://salsa.debian.org/systemd-team/systemd/-/commit/170fb124a32884bd9975ee4ea9e1ffbbc2ee26b4

Indeed, default setup does not expose anymore /sys/fs/cgroup/devices which libnvidia-container uses according to https://github.com/NVIDIA/libnvidia-container/blob/ac02636a318fe7dcc71eaeb3cc55d0c8541c1072/src/nvc_container.c#L379-L382

Using the documented systemd.unified_cgroup_hierarchy=false kernel command line parameter switch back the /sys/fs/cgroup/devices entry and libnvidia-container is happier.

klueska commented 3 years ago

@lissyx Thank you for printing out the crux of the issue. We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

seemethere commented 3 years ago

Wanted to also chime in to say that I'm also experiencing this on Fedora 33

mathstuf commented 3 years ago

Could the title be updated to indicate that it is systemd cgroup layout related?

klueska commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2

If these resolve this issue, please comment and close. Thanks.

super-cooper commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support.

The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49

And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2

If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

regzon commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support. The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49 And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2 If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

Did you set the following parameter: systemd.unified_cgroup_hierarchy=false?

Or did you just upgrade all the packages?

super-cooper commented 3 years ago

I was under the impression this issue was related to adding cgroup v2 support. The systemd cgroup layout issue was resoolved in: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/49 And released today as part of libnvidia-container v1.3.2: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.3.2 If these resolve this issue, please comment and close. Thanks.

Issue resolved by the latest release. Thank you everyone <3

Did you set the following parameter: systemd.unified_cgroup_hierarchy=false?

Or did you just upgrade all the packages?

For me it was solved by upgrading the package.

regzon commented 3 years ago

Thank you, @super-cooper, for the reply.

I am having exactly the same issue on Debian Testing even after an upgrade.

1. Issue or feature description

docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

2. Steps to reproduce the issue

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi

3. Information to attach (optional if deemed irrelevant)

Device Index: 0 Device Minor: 0 Model: GeForce GTX 960M Brand: GeForce GPU UUID: GPU-6064a007-a943-7f11-1ad7-12ac87046652 Bus Location: 00000000:01:00.0 Architecture: 5.0 I0130 05:23:50.516775 4486 nvc.c:337] shutting down library context I0130 05:23:50.517704 4488 driver.c:156] terminating driver service I0130 05:23:50.518087 4486 driver.c:196] driver service terminated successfully

 - [x] Kernel version from `uname -a`

Linux stas 5.10.0-2-amd64 #1 SMP Debian 5.10.9-1 (2021-01-20) x86_64 GNU/Linux

 - [x] Any relevant kernel output lines from `dmesg`

[ 487.597570] docker0: port 1(vethb7a49e6) entered blocking state [ 487.597573] docker0: port 1(vethb7a49e6) entered disabled state [ 487.597786] device vethb7a49e6 entered promiscuous mode [ 487.773120] docker0: port 1(vethb7a49e6) entered disabled state [ 487.776548] device vethb7a49e6 left promiscuous mode [ 487.776556] docker0: port 1(vethb7a49e6) entered disabled state

 - [x] Driver information from `nvidia-smi -a`

Timestamp : Sat Jan 30 08:26:51 2021 Driver Version : 460.32.03 CUDA Version : 11.2

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : GeForce GTX 960M Product Brand : GeForce Display Mode : Disabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-6064a007-a943-7f11-1ad7-12ac87046652 Minor Number : 0 VBIOS Version : 82.07.82.00.10 MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x139B10DE Bus Id : 00000000:01:00.0 Sub System Id : 0x380217AA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 4046 MiB Used : 4 MiB Free : 4042 MiB BAR1 Memory Usage Total : 256 MiB Used : 1 MiB Free : 255 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 33 C GPU Shutdown Temp : 101 C GPU Slowdown Temp : 96 C GPU Max Operating Temp : 92 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : N/A Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 135 MHz SM : 135 MHz Memory : 405 MHz Video : 405 MHz Applications Clocks Graphics : 1097 MHz Memory : 2505 MHz Default Applications Clocks Graphics : 1097 MHz Memory : 2505 MHz Max Clocks Graphics : 1202 MHz SM : 1202 MHz Memory : 2505 MHz Video : 1081 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1351 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 2 MiB

 - [x] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.2 API version: 1.41 Go version: go1.13.15 Git commit: 2291f61 Built: Mon Dec 28 16:17:34 2020 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.2 API version: 1.41 (minimum version 1.12) Go version: go1.13.15 Git commit: 8891c58 Built: Mon Dec 28 16:15:28 2020 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.3 GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b runc: Version: 1.0.0-rc92 GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-==============================-============-================================================================= un bumblebee-nvidia (no description available) ii glx-alternative-nvidia 1.2.0 amd64 allows the selection of NVIDIA as GLX provider un libegl-nvidia-legacy-390xx0 (no description available) un libegl-nvidia-tesla-418-0 (no description available) un libegl-nvidia-tesla-440-0 (no description available) un libegl-nvidia-tesla-450-0 (no description available) ii libegl-nvidia0:amd64 460.32.03-1 amd64 NVIDIA binary EGL library un libegl1-glvnd-nvidia (no description available) un libegl1-nvidia (no description available) un libgl1-glvnd-nvidia-glx (no description available) ii libgl1-nvidia-glvnd-glx:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant) un libgl1-nvidia-glx (no description available) un libgl1-nvidia-glx-any (no description available) un libgl1-nvidia-glx-i386 (no description available) un libgl1-nvidia-legacy-390xx-glx (no description available) un libgl1-nvidia-tesla-418-glx (no description available) un libgldispatch0-nvidia (no description available) ii libgles-nvidia1:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL|ES 1.x library ii libgles-nvidia2:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL|ES 2.x library un libgles1-glvnd-nvidia (no description available) un libgles2-glvnd-nvidia (no description available) un libglvnd0-nvidia (no description available) ii libglx-nvidia0:amd64 460.32.03-1 amd64 NVIDIA binary GLX library un libglx0-glvnd-nvidia (no description available) ii libnvidia-cbl:amd64 460.32.03-1 amd64 NVIDIA binary Vulkan ray tracing (cbl) library un libnvidia-cbl-460.32.03 (no description available) un libnvidia-cfg.so.1 (no description available) ii libnvidia-cfg1:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) ii libnvidia-container-tools 1.3.2-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.3.2-1 amd64 NVIDIA container runtime library ii libnvidia-eglcore:amd64 460.32.03-1 amd64 NVIDIA binary EGL core libraries un libnvidia-eglcore-460.32.03 (no description available) ii libnvidia-glcore:amd64 460.32.03-1 amd64 NVIDIA binary OpenGL/GLX core libraries un libnvidia-glcore-460.32.03 (no description available) ii libnvidia-glvkspirv:amd64 460.32.03-1 amd64 NVIDIA binary Vulkan Spir-V compiler library un libnvidia-glvkspirv-460.32.03 (no description available) un libnvidia-legacy-340xx-cfg1 (no description available) un libnvidia-legacy-390xx-cfg1 (no description available) un libnvidia-ml.so.1 (no description available) ii libnvidia-ml1:amd64 460.32.03-1 amd64 NVIDIA Management Library (NVML) runtime library ii libnvidia-ptxjitcompiler1:amd64 460.32.03-1 amd64 NVIDIA PTX JIT Compiler ii libnvidia-rtcore:amd64 460.32.03-1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library un libnvidia-rtcore-460.32.03 (no description available) un libnvidia-tesla-418-cfg1 (no description available) un libnvidia-tesla-440-cfg1 (no description available) un libnvidia-tesla-450-cfg1 (no description available) un libopengl0-glvnd-nvidia (no description available) ii nvidia-alternative 460.32.03-1 amd64 allows the selection of NVIDIA as GLX provider un nvidia-alternative--kmod-alias (no description available) un nvidia-alternative-legacy-173xx (no description available) un nvidia-alternative-legacy-71xx (no description available) un nvidia-alternative-legacy-96xx (no description available) ii nvidia-container-runtime 3.4.1-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.4.1-1 amd64 NVIDIA container runtime hook un nvidia-cuda-mps (no description available) un nvidia-current (no description available) un nvidia-current-updates (no description available) ii nvidia-detect 460.32.03-1 amd64 NVIDIA GPU detection utility un nvidia-docker (no description available) ii nvidia-docker2 2.5.0-1 all nvidia-docker CLI wrapper ii nvidia-driver 460.32.03-1 amd64 NVIDIA metapackage un nvidia-driver-any (no description available) ii nvidia-driver-bin 460.32.03-1 amd64 NVIDIA driver support binaries un nvidia-driver-bin-460.32.03 (no description available) un nvidia-driver-binary (no description available) ii nvidia-driver-libs:amd64 460.32.03-1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) un nvidia-driver-libs-any (no description available) un nvidia-driver-libs-nonglvnd (no description available) ii nvidia-egl-common 460.32.03-1 amd64 NVIDIA binary EGL driver - common files ii nvidia-egl-icd:amd64 460.32.03-1 amd64 NVIDIA EGL installable client driver (ICD) un nvidia-glx-any (no description available) ii nvidia-installer-cleanup 20151021+13 amd64 cleanup after driver installation with the nvidia-installer un nvidia-kernel-460.32.03 (no description available) ii nvidia-kernel-common 20151021+13 amd64 NVIDIA binary kernel module support files ii nvidia-kernel-dkms 460.32.03-1 amd64 NVIDIA binary kernel module DKMS source un nvidia-kernel-source (no description available) ii nvidia-kernel-support 460.32.03-1 amd64 NVIDIA binary kernel module support files un nvidia-kernel-support--v1 (no description available) un nvidia-kernel-support-any (no description available) un nvidia-legacy-304xx-alternative (no description available) un nvidia-legacy-304xx-driver (no description available) un nvidia-legacy-340xx-alternative (no description available) un nvidia-legacy-340xx-vdpau-driver (no description available) un nvidia-legacy-390xx-vdpau-driver (no description available) un nvidia-legacy-390xx-vulkan-icd (no description available) ii nvidia-legacy-check 460.32.03-1 amd64 check for NVIDIA GPUs requiring a legacy driver un nvidia-libopencl1-dev (no description available) ii nvidia-modprobe 460.32.03-1 amd64 utility to load NVIDIA kernel modules and create device nodes un nvidia-nonglvnd-vulkan-common (no description available) un nvidia-nonglvnd-vulkan-icd (no description available) un nvidia-opencl-icd (no description available) ii nvidia-openjdk-8-jre 9.+8u272-b10-0+deb9u1~11.1.1-4 amd64 Obsolete OpenJDK Java runtime, for NVIDIA applications ii nvidia-persistenced 460.32.03-1 amd64 daemon to maintain persistent software state in the NVIDIA driver un nvidia-settings (no description available) ii nvidia-smi 460.32.03-1 amd64 NVIDIA System Management Interface ii nvidia-support 20151021+13 amd64 NVIDIA binary graphics driver support files un nvidia-tesla-418-vdpau-driver (no description available) un nvidia-tesla-418-vulkan-icd (no description available) un nvidia-tesla-440-vdpau-driver (no description available) un nvidia-tesla-440-vulkan-icd (no description available) un nvidia-tesla-450-vulkan-icd (no description available) un nvidia-tesla-alternative (no description available) ii nvidia-vdpau-driver:amd64 460.32.03-1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver ii nvidia-vulkan-common 460.32.03-1 amd64 NVIDIA Vulkan driver - common files ii nvidia-vulkan-icd:amd64 460.32.03-1 amd64 NVIDIA Vulkan installable client driver (ICD) un nvidia-vulkan-icd-any (no description available) ii xserver-xorg-video-nvidia 460.32.03-1 amd64 NVIDIA binary Xorg driver un xserver-xorg-video-nvidia-any (no description available) un xserver-xorg-video-nvidia-legacy-304xx (no description available)

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

version: 1.3.2 build date: 2021-01-25T11:07+00:00 build revision: fa9c778f687e9ac7be52b0299fa3b6ac2d9fbf93 build compiler: x86_64-linux-gnu-gcc-8 8.3.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [x] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
`/var/log/nvidia-container-toolkit.log` is not generated.
 - [x] Docker command, image and tag used

docker run --rm --gpus all nvidia/cuda:11.0-base-ubuntu20.04 nvidia-smi



@klueska Could you please check the issue?
elezar commented 3 years ago

@regzon thanks for indicating that this is still and issue. Could you please check what your systemd cgroup configuration is? (see for example this other issue which shows similar behaviour: https://github.com/docker/cli/issues/2104#issuecomment-535560873)

klueska commented 3 years ago

@regzon your issue is likely related to the fact that libnvidia-container does not support cgroups v2.

You will need to follow the suggestion in the comments above for https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-760059332 to force systemd to use v1 cgroups.

In any case -- we do not officially support Debian Testing nor cgroups v2 (yet).

regzon commented 3 years ago

@elezar @klueska thank you for your help. When forcing the systemd to not use the unified hierarchy, everything works fine. I thought that the latest libnvidia-container upgrade would resolve the issue (as it did for @super-cooper). But if the upgrade is not intended to fix the issue with cgroups, then everything is fine.

flixr commented 3 years ago

@klueska I'm having the same "issue", i.e. missing support for cgroups v2 (which I would very much like for other reasons). Is there already an issue for this to track?

klueska commented 3 years ago

We are not planning on building support for cgroups v2 into the existing nvidia-docker stack.

Please see my comment above for more info: https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-760189260

flixr commented 3 years ago

Let me rephrase it then: I want to use nvidia-docker on a system where cgroup v2 is enabled (systemd.unified_cgroup_hierarchy=true). Right now this is not working and this bug is closed. So is there an issue that I can track to know when I can use nvidia-docker on hosts with cgroup v2 enabled?

klueska commented 3 years ago

We have it tracked in our internal JIRA with a link to this this issue as the location to report once the work is complete: https://github.com/NVIDIA/libnvidia-container/issues/111

jelmd commented 3 years ago

facebook oomd requires cgroup v2, i.e. systemd.unified_cgroup_hierarchy=1. So either users freeze the boxes pretty often and render them unusable, or they cannot use nvidia-containers. Both is crap. We will probably drop the nvidia-docker non-sense.

4n0m4l0u5 commented 3 years ago

For Debian users, you can disable cgroup hierarchy by editing /etc/default/grub and adding systemd.unified_cgroup_hierarchy=0 to the end of the GRUB_CMDLINE_LINUX_DEFAULT options. Example: ... GRUB_CMDLINE_LINUX_DEFAULT="quiet systemd.unified_cgroup_hierarchy=0" ...

Then run update-grub and reboot for changes to take effect.

It's worth noting that I also had to modify /etc/nvidia-container-runtime/config.toml to remove the '@' symbol and update to the correct location of ldconfig for my system (Debian Unstable). eg: ldconfig = "/usr/sbin/ldconfig"

This worked for me, I hope this saves someone else some time.

Zethson commented 3 years ago

Fix on Arch:

Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual.

gabrielebaris commented 3 years ago

@Zethson I also use Arch and yesterday I followed your suggestion. It seemed to work (I was able to start the containers), but running nvidia-smi I had no accesso to my GPU from inside docker. Reading the other answers in this issue, I solved by adding systemd.unified_cgroup_hierarchy=0 to boot kernel parameters and commenting again the entry no-cgroups in /etc/nvidia-container-runtime/config.toml

wernight commented 3 years ago

Arch has now cgroup v2 enabled by default, so it'd be useful to plan for supporting it.

adam505hq commented 3 years ago

Fix on Arch:

Edit /etc/nvidia-container-runtime/config.toml and change #no-cgroups=false to no-cgroups=true. After a restart of the docker.service everything worked as usual.

Awesome this works well.

biggs commented 3 years ago

Fix on NixOS (where cgroup v2 is also now default): add systemd.enableUnifiedCgroupHierarchy = false; and restart.

prismplex commented 3 years ago

This worked for me on Manjaro Linux (Arch Linux as base) without deactivating cgroup v2: Create the folder docker.service.d under /etc/systemd/system, create file override.conf in this folder:

[Service]
ExecStartPre=-/usr/bin/nvidia-modprobe -c 0 -u

After that you have to add the following content to your docker-compose.yml, thank you @DanielCeregatti :

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

Background: The nvidia-uvm and nvidia-uvm-tools folder did not exist unter /dev for me. After running nvidia-modprobe -c 0 -u they appeared but disappeared after reboot. This workaround adds these folders before docker starts. Unfortunately I don`t know why these folders do not exist by default. Maybe somebody can complement. Currently using Linux 5.12. Maybe it has to do with this kernel version.

Edit: This workaround works only if the container using NVIDIA is restarted afterwards. I do not know why, but if not, the container starts, but cannot access the created directories.

Update 25.06.2021: Found out why I had to restart jellyfin. Docker started before my disks were online. If somebody has this problem too, here is the fix: https://github.com/openmediavault/openmediavault/issues/458#issuecomment-628076472

nihil21 commented 3 years ago

After that you have to add the following content to your docker-compose.yml, thank you @DanielCeregatti :

    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

Hi, I'm running Manjaro and facing the same issue: when I run the container using docker run (e.g. docker run -it --gpus all --privileged -v /dev:/dev --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))") it works, but I did not manage to make it work with docker-compose up.

Could you please post a complete, working docker-compose.yml file? Thank you very much!

nihil21 commented 3 years ago

Never mind, I have just managed to make it work with docker-compose. I'll post here a minimal working example:

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;print(tf.config.list_physical_devices('GPU'))"
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: [gpu]
japm48 commented 3 years ago

Minimal working example on Arch with nvidia-container-toolkit (from AUR) installed:

docker run --rm --gpus all \
  --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl  \
  nvidia/cuda:11.0-base nvidia-smi

Without the --devices I get this unhelpful message: Failed to initialize NVML: Unknown Error.

Edit: also make sure you have no-cgroups = true in /etc/nvidia-container-runtime/config.toml (thanks @mpizenberg)

klueska commented 3 years ago

https://github.com/NVIDIA/nvidia-docker/issues/1549#issuecomment-939943090

mpizenberg commented 3 years ago

Minimal working example on Arch with nvidia-container-toolkit (from AUR) installed: ... Without the --devices I get this unhelpful message: Failed to initialize NVML: Unknown Error.

@japm48 may I ask what changes did you do exactly to have that command work? Did you also do the systemd.unified_cgroup_hierarchy=false kernel parameter change and the no-cgroups = false nvidia config change?

Without doing those, I'm on Arch with kernel 5.14.14, with version 1.5.1-1 of the aur/nvidia-container-toolkit and when running the command:

docker run --rm --gpus all \
  --device /dev/nvidia0 --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl  \
  nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi

I get

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

Same if I change no-cgroup config to false. I haven't tried changing my kernel parameters though I'd like to avoid that!

EDIT: it now works with some changes

Ok I actually got it working on my system with the following setup:

docker run --rm --gpus all \
         --device /dev/nvidia0 --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools --device /dev/nvidiactl \
         nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
TaridaGeorge commented 2 years ago

After settings no-cgroup to true I get this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

OS: debian 11

ninnghazad commented 2 years ago

After settings no-cgroup to true I get this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

OS: debian 11

https://github.com/NVIDIA/nvidia-docker/issues/1163#issuecomment-824775675

ldconfig = "/sbin/ldconfig.real"

this worked for me on debian 11.

japm48 commented 2 years ago

@mpizenberg

@japm48 may I ask what changes did you do exactly to have that command work? Did you also do the systemd.unified_cgroup_hierarchy=false kernel parameter change and the no-cgroups = false nvidia config change?

I'm really sorry I didn't see your message.

I had no-cgroups = true in /etc/nvidia-container-runtime/config.toml, but I didn't modify the file. This is likely because, as I did a fresh install, I got the patched config file; and I guess you had the previous (unpatched) version installed, so it wasn't overwritten on update.

chenhengqi commented 2 years ago

@lissyx Thank you for printing out the crux of the issue. We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

@klueska It's been 11 months, any updates on this rearchitecting :)

klueska commented 2 years ago

The rearchitecture work has been slower that we hoped, but (somewhat because of this), we have now built support for cgroupv2 in libnvidia-container and it is currently under review. We hope to have an RC out before christmas.

Here is the MR chain: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/113 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/114 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/115 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/116 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/117

klueska commented 2 years ago

We now have an RC of libnvidia-container out that adds support for cgroupv2.

If you would like to try it out, make sure and add the experimental repo to your apt sources and install the latest packages:

For DEBs

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update
sudo apt-get install -y libnvidia-container-tools libnvidia-container1

For RPMs

sudo yum-config-manager --enable libnvidia-container-experimental
sudo yum install -y libnvidia-container-tools libnvidia-container1
euri10 commented 2 years ago

was previously using the systemd.unified_cgroup_hierarchy=false kernel command line parameter on debian bullseye and removed it after I upgraded to 1.8.0rc1 and every container that is using my gpu seems to be working perfectly fine so far

thanks !

the only minor diff I have with the instructions above is that my repo sources is called nvidia-docker.list and not libnvidia-container.list not sure why

klueska commented 2 years ago

Yes, that may be true for many users and I should have pointed that out.

We used to host packages across three different repos and recently consolidated down to just 1 (i.e. libnvidia-container). The changes were mostly transparent, but I can see how instructions for enabling the experimental repo may need to be tweaked depending on which repos you actually have configured.

Basically we used to package binaries and host packages as seen in the table below:

Binary Package Repo
nvidia-docker nvidia-docker2 nvidia.github.io/nvidia-docker
nvidia-container-runtime nvidia-container-runtime nvidia.github.io/nvidia-container-runtime
nvidia-container-toolkit nvidia-container-toolkit nvidia.github.io/nvidia-container-runtime
nvidia-container-cli libnvidia-container-tools nvidia.github.io/libnvidia-container
libnvidia-container.so.1 libnvidia-container1 nvidia.github.io/libnvidia-container

But that changed recently to:

Binary Package Repo
nvidia-docker nvidia-docker2 nvidia.github.io/libnvidia-container
nvidia-container-runtime nvidia-container-toolkit nvidia.github.io/libnvidia-container
nvidia-container-toolkit nvidia-container-toolkit nvidia.github.io/libnvidia-container
nvidia-container-cli libnvidia-container-tools nvidia.github.io/libnvidia-container
libnvidia-container.so.1 libnvidia-container1 nvidia.github.io/libnvidia-container

So nowadays all you actually need is libnvidia-contianer.list to get access to all of new packages, but if you nvidia-docker.list that is still OK because it also contains entries for all of the repos listed in libnvidia-contianer.list(it just contains entries for more -- now unnecessary -- repos as well).

alexcpn commented 2 years ago

Pop OS till now did not have this problem; But with the latest update 21.1 I got this error

Setting no-cgroups = true in /etc/nvidia-container-runtime/config.toml made the docker start; but TensorFlow print GPU returned zero. Also tried with Podman,

sudo podman run  -e NVIDIA_VISIBLE_DEVICES=0   -it --network host -v /home/alex/coding:/tf/notebooks docker.io/tensorflow/tensorflow:latest-gpu-jupyter

but TensorFlow print GPU returned zero.

But switching to CgroupsV1 via the kernel parameter worked; For anyone else who is using PopOS which used systemd-boot instead of Grub the below commands may help

sudo kernelstub -a "systemd.unified_cgroup_hierarchy=0"
sudo update-initramfs -c -k all
reboot

Error string before this

xx@pop-os:~/coding/cnn_1/cnn_py$ docker run  --gpus device=0  -it --network host -v /home/alex/coding―tf/notebooks tensorflow/tensorflow:latest-gpu-jupyter
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled 
alex@pop-os:~/coding/cnn_1/cnn_py$ apt list  nvidia-container-toolkit 
Listing... Done
nvidia-container-toolkit/now 1.5.1-1pop1~1627998766~21.04~9847cf2 amd64 [installed,local]

After restart image

More details https://medium.com/nttlabs/cgroup-v2-596d035be4d7

muhark commented 2 years ago

For those who are here after upgrading to Ubuntu 21.10 (not supported), using the experimental version of the 18.04 and reinstalling libnvidia-container-tools and libnvidia-container1 works. (Don't forget to restart docker afterwards).

Thank you for all your amazing work!

ljburtz commented 2 years ago

@muhark I have the same issue on 21.10. which versions did you install? which commands did you use for reinstalling/experimental version that you mention fixed it? any help much appreciated!

muhark commented 2 years ago

@ljburtz, I used the command by @klueska above:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list

Then my /etc/apt/sources.list.d/nvidia-docker.list looks like the following:

#deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /

And finally I reinstalled (since I'd already been bungling the existing installations).

sudo apt-get update
sudo apt-get install --reinstall libnvidia-container-tools libnividia-container1

and finally tested with:

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

FWIW, am using a GTX 3070.

ljburtz commented 2 years ago

fantastic this works on GTX 3060 / ubuntu21.10 major thanks for replying so fast @muhark

Frikster commented 2 years ago

If you need cgroups active so cannot do no-cgroups = true and you're on PopOS 21.10, As per this explanation this one command fixed this issue for me while keeping cgroups on:

sudo kernelstub -a systemd.unified_cgroup_hierarchy=0

I then had to reboot and the issue is gone.

klueska commented 2 years ago

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.

Please see https://github.com/NVIDIA/libnvidia-container/issues/111#issuecomment-989024375 for instructions on how to get access to this RC (or wait for the full release at the end of next week).

Note: This does not directly add debian testing support, but you can point to the debian10 repo and install from there for now.

jbcpollak commented 2 years ago

This may be useful for Ubuntu users running into this issue:

So nowadays all you actually need is libnvidia-contianer.list to get access to all of new packages, but if you nvidia-docker.list that is still OK because it also contains entries for all of the repos listed in libnvidia-contianer.list(it just contains entries for more -- now unnecessary -- repos as well).

@klueska , I just wanted to mention when I go to the following URLs:

https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list https://nvidia.github.io/nvidia-docker/ubuntu20.04/nvidia-docker.list

I get a valid apt list in response.

But if I visit:

https://nvidia.github.io/nvidia-docker/ubuntu18.04/libnvidia-container.list https://nvidia.github.io/nvidia-docker/ubuntu20.04/libnvidia-container.list

I get # Unsupported distribution! # Check https://nvidia.github.io/nvidia-docker.

It appears the list has been moved back to the original filename?

klueska commented 2 years ago

These: https://nvidia.github.io/nvidia-docker/ubuntu18.04/libnvidia-container.list https://nvidia.github.io/nvidia-docker/ubuntu20.04/libnvidia-container.list

Should be: https://nvidia.github.io/libnvidia-container/ubuntu18.04/libnvidia-container.list https://nvidia.github.io/libnvidia-container/ubuntu20.04/libnvidia-container.list

jbcpollak commented 2 years ago

ah, :facepalm: , much appreciated, thanks for making it explicit.

klueska commented 2 years ago

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0

klueska commented 2 years ago

Debian 11 support has now been added such that running the following should now work as expected:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list