NVIDIA / nvidia-container-runtime

NVIDIA container runtime
Apache License 2.0
1.1k stars 159 forks source link

rootless, subuid-less GPU support with podman #145

Closed qhaas closed 11 months ago

qhaas commented 3 years ago

Given how Issue #85 is diverging in different directions and is becoming a catchall for all things podman, thought I'd break the issue described in this comment out into its own issue... In certain situations (e.g. podman issue 8580), it is not practical to setup subuid / subgid for each user, so we'd like to try to get GPU acceleration working without having to do such, of which singularity is capable

Test System (using the container-tools:3.0 appstream):

$ cat /etc/redhat-release 
CentOS Linux release 8.4.2105
$ uname -r
4.18.0-305.7.1.el8_4.x86_64
$ nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB
$ nvidia-smi | grep Version | awk '{print $3}'
470.42.01
$ nvidia-container-cli --version | head -1
version: 1.4.0
$ crun --version | grep version
crun version 0.18
$ runc --version | grep version
runc version spec: 1.0.2-dev
$ podman --version
podman version 3.0.2-dev

nvdia-container-runtime config (note that no-cgroups is now true and debug files are going to /tmp, per Issue #85):

$ cat /etc/nvidia-container-runtime/config.toml 
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/tmp/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/tmp/nvidia-container-runtime.log"

podman storage config (per Issue #85 and rootless podman guide):

$ cat ~/.config/containers/storage.conf
[storage]
driver = "overlay"
graphroot = "/tmp/${USER}-containers-peak"
rootless_storage_path = "${HOME}/.local/share/containers/storage"

[storage.options]
additionalimagestores = [
]

[storage.options.overlay]
ignore_chown_errors = "true"
mount_program = "/usr/bin/fuse-overlayfs"
mountopt = "nodev,metacopy=on"

[storage.options.thinpool]

With subuid / subgid set, things work fine, logs posted as nct_works_log.txt

$ grep ${USER}: /etc/subuid | wc -l
1
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
GPU 0: Tesla V100-PCIE-32GB (UUID: GPU-0a55d110-f8ea-4209-baa7-0e5675c7e832)

Without subuid / subgid set, GPU acceleration fails, but non GPU acceleration works. Lots posted as nct_fails_log.txt

$ grep ${USER}: /etc/subuid | wc -l
0
$ podman run --rm docker.io/centos:8 cat /etc/redhat-release
CentOS Linux release 8.3.2011
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
Error: OCI runtime error: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request

Per suggestions online, I added the account without subuid / subgid to the video group, that did not help. I'm also not clear on the implications of adding a user to the video group, so I asked over on the nvidia forums

qhaas commented 3 years ago

The above used runc, retried with crun, works fine without GPU acceleration, but still fails to run with it without subuid being set. Logs attached as nct_fails_crun_log.txt

$ grep 'runtime =' /usr/share/containers/containers.conf
runtime = "crun"
#runtime = "runc"
$ podman run --rm --security-opt=label=disable --hooks-dir=/usr/share/containers/oci/hooks.d/ docker.io/nvidia/cuda:10.2-base-centos8 nvidia-smi -L
Error: OCI runtime error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1)
$ podman run --rm docker.io/centos:8 cat /etc/redhat-release
CentOS Linux release 8.3.2011
qhaas commented 3 years ago

As an alterative, created this issue over on the podman GitHub to see if Singularity's approach to GPU acceleration is applicable to podman.

elezar commented 11 months ago

We have recently reworked our podman support and now suggest using CDI to request devices. Please see the updated documentation and feel free to open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if problems persist.