Open LukasIAO opened 5 months ago
The error:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all: unknown.
Indidates that rootless docker cannot find the CDI specifications that were generated. As far as I am aware, rootless docker modifies the path used for /etc
(and other paths) and this is what could be causing issues here for the runtime.
Since you're using a docker version that supports CDI (as an opt-in feature, I believe). Could you try the native CDI injection here.
Running:
nvidia-ctk runtime configure --runtime=docker --cdi.enabled
and restarting the docker daemon should enable this feature. (Note that the command may need to be adjusted for rootless mode to specify the config file path explicitly as per https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#rootless-mode).
Then with the CDI feature enabled in docker you should be able to run:
$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
and have the devices injected without using the nvidia
runtime.
Hey @elezar, thank you for taking the time!
CDI injection seems to be a mainline feature in Docker 26.0.0. though it is till experimental, it no longer requires the user to set DOCKER_CLI_EXPERIMENTAL, as was the case in 25.x.
The native injection worked on rootful after configuring the daemon as suggested, though the rootless Docker still runs into issues as listed below.
Before applying the suggested configurations I tested the following on rootless:
$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].
ERRO[0000] error waiting for container: context canceled
$ docker run --rm -ti --runtime=nvidia --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: could not select device driver "cdi" with capabilities: [].
$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all: unknown.
After applying the configuration with nvidia-ctk runtime configure --runtime=docker --cdi.enabled --config=$HOME/.config/docker/daemon.json
the daemon.json looks like this:
{
"features": {
"cdi": true
},
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
Restarting Docker and testing the CDI injections again lead to the following regardless of c-group setting:
$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.
$ docker run --rm -ti --runtime=nvidia --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.
$ docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=all: unknown.
I checked the location for the configurations for both docker clients:
Both point to:
CDI spec directories:
/etc/cdi
/var/run/cdi
However, it looks like nothing was created under /var/run/cdi
. Permissions for nvidia.yaml
:
/etc/cdi$ ls -la
total 32
drwxr-xr-x 2 root root 4096 ožu 29 23:22 .
drwxr-xr-x 167 root root 12288 ožu 29 23:22 ..
-rw-r--r-- 1 root root 13203 ožu 29 23:22 nvidia.yaml
The Docker docs for enabling CDI devices suggest manually setting the spec location, but it does not seem to make a difference in this case.
{
"features": {
"cdi": true
},
"cdi-spec-dirs": ["/etc/cdi/", "/var/run/cdi"],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
Could you try generate (or copy) a CDI spec to /var/run/cdi in addition to /etc/cdi and see if this fixes the rootless case.
I copied the yaml to /var/run/cdi
, restarted both Dockers, and tested again. Unfortunatly, there was no change in behavior.
/var/run/cdi$ ls -la
total 16
drwxr-xr-x 2 root root 60 tra 3 10:02 .
drwxr-xr-x 51 root root 1580 tra 3 10:02 ..
-rw-r--r-- 1 root root 13203 tra 3 10:02 nvidia.yaml
I think the key is the following: https://github.com/moby/moby/blob/8599f2a3fb884afcbbf1471ec793fbcbc327cd35/cmd/dockerd/docker.go#L65C1-L72C1
I would assume that for the docker daemon running with the rootless kit, the path where it is trying to resolve the CDI device specifications is not /var/run/cdi
or /etc/cdi
. It may be good to create an issue (or transfer this one) to https://github.com/moby/moby so that we can get input from the developers there as to where these paths map to.
It may be sufficient to copy the spec file to a location that is readable by the daemon to confirm.
Note that plugins are also handled differently for rootless mode: https://github.com/moby/moby/blob/8599f2a3fb884afcbbf1471ec793fbcbc327cd35/pkg/plugins/discovery_unix.go#L11
I wonder if this implies that the "correct" location for rootless is $HOME/.docker/cdi
or $HOME/.docker/run/cdi
?
I just tested @klueska idea, by copying the yaml to $HOME/.docker/cdi
and $HOME/.docker/run/cdi
respectively, and specifying the custom location in the daemon.
{
"features": {
"cdi": true
},
"cdi-spec-dirs": ["/home/username/.docker/cdi/", "/home/username/.docker/run/cdi/"],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
CDI spec directories:
/home/username/.docker/cdi/
/home/username/.docker/run/cdi/
With this change, the native CDI injection does indeed run on rootless.
/.config/docker$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b6022b4d-71db-8f15-15de-26a719f6b3e1)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-22420f7d-6edb-e44a-c322-4ce539cade19)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-5e3444e2-8577-0e99-c6ee-72f6eb2bd28c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-dd1f811d-a280-7e2e-bf7e-b84f7a977cc1)
It's good to know there is a path to making this work. I'd be interested to know if these are the "default" locations if you remove cdi-spec-dirs
entirely.
It's good to know there is a path to making this work. I'd be interested to know if these are the "default" locations if you remove
cdi-spec-dirs
entirely.
I would be surprised if this is the case since iirc we explicitly set /etc/cdi
and /var/run/cdi
in the Daemon.
You can see the Docker info
of the rootles client in my original reply to @elezar. Before specifying it explicitly, I wanted to check where the client was looking for the config. Once CDI is enabled, both rootless and rootful seems to default to:
CDI spec directories:
/etc/cdi
/var/run/cdi
The choice of ./docker/cdi
seemed fitting, however.
That seems like a bug that should be filed against moby/docker then.
It might also be worth including in the documentation for the CDI, that a rootless Docker client requires the yaml to be generated/moved to a location the daemon has access to, wherever that may end up being.
I was a similar bug and keep reading and try things here, because dont found more info.
I'm in manjaro and this bug was very werid, because yersterday my docker is working well with my GPU, however after a update, something break it, and when i try use GPU in docker ollama it shows this failed to stat CDI host device "/dev/nvidia-modeset"
Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-modeset": no such file or directory
My solution looks like stupid or strange, or maybe could think that not work, but i've reinstalled nvidia-container-toolkit using pacman, and to my surprise it worked, I never thought that something so silly would work. My silly solution for me strange case:
sudo pacman -S nvidia-container-toolkit
PD: I use podman
Anyone figured this one out on GCP COS VM?
Hello everyone,
we have recently set up a rootless docker instance alongside our existing docker on one of our servers, but ran into issues mounting host GPUs into the rootless containers. A workaround was presented in issue #85 (toggling no-cgroups to switch between rootful and rootless) with a mention of a better solution in the form of Nvidia CDI coming as an experimental feature in Docker 25.
After updating to the newest Docker releases and setting up CDI, our regular Docker instance behaved as we expected based on the documentation, but the rootless instance still runs into issues.
Setup to reproduce:
config.toml (click to expand)
``` #accept-nvidia-visible-devices-as-volume-mounts = false #accept-nvidia-visible-devices-envvar-when-unprivileged = true disable-require = false #swarm-resource = "DOCKER_RESOURCE_GPU" [nvidia-container-cli] #debug = "/var/log/nvidia-container-toolkit.log" environment = [] #ldcache = "/etc/ld.so.cache" ldconfig = "@/sbin/ldconfig.real" load-kmods = true #no-cgroups = true #no-cgroups = false #path = "/usr/bin/nvidia-container-cli" #root = "/run/nvidia/driver" #user = "root:video" [nvidia-container-runtime] #debug = "/var/log/nvidia-container-runtime.log" log-level = "info" mode = "auto" runtimes = ["docker-runc", "runc"] [nvidia-container-runtime.modes] [nvidia-container-runtime.modes.csv] mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d" ```nvidia.yaml (click to expand)
The issue: When
no-cgroups = false
CDI injection works fine for the regular Docker instance:but produces the following errors for the rootless version:
Running
docker run --rm --gpus all ubuntu nvidia-smi
results in the same error as without OCI. This seems to be consistent across all variations listed on the Specialized Configurations for Docker page:Interestingly, setting
no-cgroups = true
disables the regular use of GPUs with rootful Docker:but still allows for CDI injections:
With control groups disabled, the rootless daemon is able to use exposed GPUs as outlined in the Docker docs:
TLDR Disabling c-groups allows the rootless containers to use exposed GPUs using the regular docker run --gpus flag. This in turn disables the rootful container's GPU access. Leaving control groups enabled reverses the effect, as outlined in #85 .
Disabling c-groups and using Nvidia CDI, the rootful Docker can still use GPU injection, even though regular GPU access is barred, while the rootless container uses the exposed GPUs. CDI injection for rootless fails in both cases, however.
This seems like a definite improvement, but I'm not sure it's intended behavior. The CDI injection failing with rootless regardless of control group setting leads me to believe this is unintended, unless rootless is not yet supported by Nvidia CDI.
Any insights would be greatly appreciated!