NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.19k stars 14.2k forks source link

NVIDIA Docker failed to start container due to cgroup v2 #127146

Open Abdillah opened 3 years ago

Abdillah commented 3 years ago

Describe the bug NVIDIA Docker (virtualisation.docker.enableNvidia) cannot be used on default NixOS option due to cgroup v2 not supported by libnvidia-container (the error, root cause). The container refuse to spawn because this runtime error.

$ nvidia-docker run -it -p 3000:3000 mycroft/mimic2:gpu

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0003] error waiting for container: context canceled 

There are two potential solutions as https://github.com/NVIDIA/libnvidia-container/issues/111#issuecomment-782332657,

  1. Change systemd to hybrid mode when NVIDIA enabled (systemd.enableUnifiedCgroupHierarchy = false;)
  2. Switching off cgroup v2 support in nvidia-container-runtime per https://github.com/NVIDIA/nvidia-container-runtime/issues/47#issuecomment-463495931.

    I encountered this issue exactly because I'm running rootless docker with nvidia runtime using your usernetes. Everything works, except have to set no-cgroups = true in /etc/nvidia-container-runtime/config.toml

To Reproduce Steps to reproduce the behavior:

  1. Clone mycroft/mimic2 repository and enter the directory. This might be any repository or docker image with gpu requirements.
  2. Execute the build command docker build -t mycroft/mimic2:gpu -f gpu.Dockerfile .
  3. Execute the run command nvidia-docker run -it -p 3000:3000 mycroft/mimic2:gpu.

Expected behavior Run happily ever after.

Metadata Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.40, NixOS, 21.11pre293089.1c2986bbb80 (Porcupine)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.4pre20210503_6d2553a`
 - channels(root): `"nixos-21.11pre293089.1c2986bbb80, nixos-hardware, nixos-unstable-21.05pre283367.0a5f5bab0e0"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
- systemd.enableUnifiedCgroupHierarchy
- virtualisation.docker.enableNvidia

# a list of nixos modules affected by the problem
module:
- systemd
- nvidia-docker
Abdillah commented 2 years ago

Cgroupv2 is now supported by https://github.com/NVIDIA/libnvidia-container per v1.8.0 release.