NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.26k stars 2.03k forks source link

nvidia-container-cli: container error: cgroup subsystem devices not found: unknown #1660

Closed dixson3 closed 1 year ago

dixson3 commented 2 years ago

Recently installed docker and nvidia cuda tools onto a PopOS 22.04 (Ubuntu 22.04) system. I am attempting to enable GPU access in docker.

❯ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "ldconfig; nvidia-smi"
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

I have already attempted to perform a clean install of docker (following the instructions at https://docs.docker.com/engine/install/ubuntu/) and the install of nvidia-docker2 (following the instructions at https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#install-guide)

Am I missing a step? How can I resolve this?


Here are my particulars:

> lsb_release -a
No LSB modules are available.
Distributor ID: Pop
Description:    Pop!_OS 22.04 LTS
Release:    22.04
Codename:   jammy
> nvidia-smi
Thu Aug  4 12:16:59 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0  On |                  N/A |
|  0%   35C    P8    11W / 310W |    449MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4239      G   /usr/lib/xorg/Xorg                203MiB |
|    0   N/A  N/A      5148      G   /usr/bin/gnome-shell               64MiB |
|    0   N/A  N/A      6664      G   alacritty                          10MiB |
|    0   N/A  N/A      8064      G   firefox                           168MiB |
+-----------------------------------------------------------------------------+
> apt list | rg installed | rg docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker-ce-cli/jammy,now 5:20.10.17~3-0~ubuntu-jammy amd64 [installed]
docker-ce-rootless-extras/jammy,now 5:20.10.17~3-0~ubuntu-jammy amd64 [installed,automatic]
docker-ce/jammy,now 5:20.10.17~3-0~ubuntu-jammy amd64 [installed]
docker-compose-plugin/jammy,now 2.6.0~ubuntu-jammy amd64 [installed]
docker-scan-plugin/jammy,now 0.17.0~ubuntu-jammy amd64 [installed,automatic]
nvidia-docker2/jammy,jammy,now 2.9.0-1~1644261147~22.04~c7639fe all [installed]
> apt list | rg installed | rg nvidia

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnvidia-cfg1-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-common-515/jammy,jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed all [installed,automatic]
libnvidia-compute-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-compute-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed i386 [installed,automatic]
libnvidia-container-tools/jammy,now 1.8.0-1~1644255740~22.04~76ed4b4 amd64 [installed,automatic]
libnvidia-container1/jammy,now 1.8.0-1~1644255740~22.04~76ed4b4 amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-decode-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed i386 [installed,automatic]
libnvidia-egl-wayland1/jammy,now 1:1.1.9-1.1 amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-encode-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed i386 [installed,automatic]
libnvidia-extra-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-fbc1-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed i386 [installed,automatic]
libnvidia-gl-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
libnvidia-gl-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed i386 [installed,automatic]
nvidia-compute-utils-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
nvidia-container-toolkit/jammy,now 1.8.0-1pop1~1644260705~22.04~60691e5 amd64 [installed,automatic]
nvidia-dkms-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
nvidia-docker2/jammy,jammy,now 2.9.0-1~1644261147~22.04~c7639fe all [installed]
nvidia-driver-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
nvidia-kernel-common-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
nvidia-kernel-source-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
nvidia-settings/jammy,now 510.47.03-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
system76-driver-nvidia/jammy,jammy,now 20.04.60~1659452571~22.04~9ef923b all [installed]
xserver-xorg-video-nvidia-515/jammy,now 515.48.07-1pop0~1657640780~22.04~e863eed amd64 [installed,automatic]
klueska commented 2 years ago

https://github.com/NVIDIA/nvidia-docker/issues/1643#issuecomment-1152957965

woook commented 2 years ago

I appear to be having the same issue even with a later version of the toolkit:


lib-version: 1.11.0
build date: 2022-09-18T23:16+00:00
build revision: 
build compiler: x86_64-linux-gnu-gcc-11 11.2.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -fPIC -Wdate-time -D_FORTIFY_SOURCE=2 -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -fPIC -g -O2 -ffile-prefix-map=/build/libnvidia-container-CeXONE/libnvidia-container-1.11.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/tirpc -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro```
EnkrateiaLucca commented 1 year ago

I'm having the same issue!

SebasGarcia08 commented 1 year ago

I'm having the same issue, any updates?

klueska commented 1 year ago

From what I've gathered responding to other tickets with this same issue, PopOS appears to compile and distribute their own version of libnvidia-container with WITH_NVCGO=no at compile time . Without this set to yes (which it is by default), there is no support for cgroupv2, and can result in the error you see here.

Since PopOS is building this library themselves, even recent versions of the library will appear to exhibit this issue, even if the same version of the official library does not.

Please make sure to override the PopOS repos and pull from the official NVIDIA repos instead.

A community provided solution for doing so can be found here: https://gist.github.com/kuang-da/2796a792ced96deaf466fdfb7651aa2e