Closed forestofrain closed 9 months ago
@forestofrain this is something that we're aware of, but haven't really found the correct solution for. On most systems it's limited to DRM devices nodes that have root:video
ownership. As you point out, however, this is dependent on the driver parameters.
One thing we're considering is https://github.com/cncf-tags/container-device-interface/issues/175 where the spec will include the additional GIDs that are required in the container to have access to the device nodes when these are created with 0660
permissions and not 0666
.
In our testing, it was also not quite clear if setting the GID in the CDI specification would have the same effect.
Would you be able to confirm this on your end. That is to say:
nvidia-ctk cdi generate
gid
field to the device nodes that require it (/dev/nvidia-modeset
, /dev/nvidia0
, /dev/nvidiactl
). For example:
deviceNodes:
- path: /dev/nvidia-modeset
gid: 27
- path: /dev/nvidiactl
gid: 27
Then repeat your experiments.
If this works as expected, then the spec extension is not required and we can work on updating our generated spec to include the required GID information.
If this still fails then we would have to confirm that running podman
with --group-add=27
-- which updates the AdditionalGIDs field in the OCI runtime spec -- works as desired.
Same permission error. Any other information you need?
podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
Failed to initialize NVML: Insufficient Permissions
podman run --rm --group-add=27 --device nvidia.com/gpu=all ubuntu nvidia-smi -L
Failed to initialize NVML: Insufficient Permissions
My /etc/cdi/nvidia.yaml
with your suggested changes. Note my 3080 is only for compute and my graphics an Arc 770.
---
cdiVersion: 0.5.0
containerEdits:
deviceNodes:
- path: /dev/nvidia-modeset
gid: 27
- path: /dev/nvidia-uvm
- path: /dev/nvidia-uvm-tools
- path: /dev/nvidiactl
gid: 27
hooks:
- args:
- nvidia-ctk
- hook
- update-ldcache
- --folder
- /usr/lib64
hookName: createContainer
path: /usr/sbin/nvidia-ctk
mounts:
- containerPath: /opt/bin/nvidia-cuda-mps-control
hostPath: /opt/bin/nvidia-cuda-mps-control
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /opt/bin/nvidia-cuda-mps-server
hostPath: /opt/bin/nvidia-cuda-mps-server
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /opt/bin/nvidia-debugdump
hostPath: /opt/bin/nvidia-debugdump
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /run/nvidia-persistenced/socket
hostPath: /run/nvidia-persistenced/socket
options:
- ro
- nosuid
- nodev
- bind
- noexec
- containerPath: /usr/lib64/libEGL_nvidia.so.545.29.06
hostPath: /usr/lib64/libEGL_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libGLESv1_CM_nvidia.so.545.29.06
hostPath: /usr/lib64/libGLESv1_CM_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libGLESv2_nvidia.so.545.29.06
hostPath: /usr/lib64/libGLESv2_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libGLX_nvidia.so.545.29.06
hostPath: /usr/lib64/libGLX_nvidia.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libcuda.so.545.29.06
hostPath: /usr/lib64/libcuda.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libcudadebugger.so.545.29.06
hostPath: /usr/lib64/libcudadebugger.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvcuvid.so.545.29.06
hostPath: /usr/lib64/libnvcuvid.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-allocator.so.545.29.06
hostPath: /usr/lib64/libnvidia-allocator.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-cfg.so.545.29.06
hostPath: /usr/lib64/libnvidia-cfg.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-egl-gbm.so.1.1.1
hostPath: /usr/lib64/libnvidia-egl-gbm.so.1.1.1
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-eglcore.so.545.29.06
hostPath: /usr/lib64/libnvidia-eglcore.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-encode.so.545.29.06
hostPath: /usr/lib64/libnvidia-encode.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-fbc.so.545.29.06
hostPath: /usr/lib64/libnvidia-fbc.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-glcore.so.545.29.06
hostPath: /usr/lib64/libnvidia-glcore.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-glsi.so.545.29.06
hostPath: /usr/lib64/libnvidia-glsi.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-glvkspirv.so.545.29.06
hostPath: /usr/lib64/libnvidia-glvkspirv.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-gpucomp.so.545.29.06
hostPath: /usr/lib64/libnvidia-gpucomp.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-gtk3.so.545.29.06
hostPath: /usr/lib64/libnvidia-gtk3.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-ml.so.545.29.06
hostPath: /usr/lib64/libnvidia-ml.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-ngx.so.545.29.06
hostPath: /usr/lib64/libnvidia-ngx.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-nvvm.so.545.29.06
hostPath: /usr/lib64/libnvidia-nvvm.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-opencl.so.545.29.06
hostPath: /usr/lib64/libnvidia-opencl.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-opticalflow.so.545.29.06
hostPath: /usr/lib64/libnvidia-opticalflow.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-pkcs11-openssl3.so.545.29.06
hostPath: /usr/lib64/libnvidia-pkcs11-openssl3.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-ptxjitcompiler.so.545.29.06
hostPath: /usr/lib64/libnvidia-ptxjitcompiler.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-rtcore.so.545.29.06
hostPath: /usr/lib64/libnvidia-rtcore.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-tls.so.545.29.06
hostPath: /usr/lib64/libnvidia-tls.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvidia-wayland-client.so.545.29.06
hostPath: /usr/lib64/libnvidia-wayland-client.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/lib64/libnvoptix.so.545.29.06
hostPath: /usr/lib64/libnvoptix.so.545.29.06
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/sbin/nvidia-persistenced
hostPath: /usr/sbin/nvidia-persistenced
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/sbin/nvidia-smi
hostPath: /usr/sbin/nvidia-smi
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /lib/firmware/nvidia/545.29.06/gsp_ga10x.bin
hostPath: /lib/firmware/nvidia/545.29.06/gsp_ga10x.bin
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /lib/firmware/nvidia/545.29.06/gsp_tu10x.bin
hostPath: /lib/firmware/nvidia/545.29.06/gsp_tu10x.bin
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
hostPath: /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
hostPath: /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
hostPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/share/vulkan/icd.d/nvidia_icd.json
hostPath: /usr/share/vulkan/icd.d/nvidia_icd.json
options:
- ro
- nosuid
- nodev
- bind
- containerPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json
hostPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json
options:
- ro
- nosuid
- nodev
- bind
devices:
- containerEdits:
deviceNodes:
- path: /dev/nvidia0
gid: 27
- path: /dev/dri/card1
- path: /dev/dri/renderD129
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- ../card1::/dev/dri/by-path/pci-0000:01:00.0-card
- --link
- ../renderD129::/dev/dri/by-path/pci-0000:01:00.0-render
hookName: createContainer
path: /usr/sbin/nvidia-ctk
- args:
- nvidia-ctk
- hook
- chmod
- --mode
- "755"
- --path
- /dev/dri
hookName: createContainer
path: /usr/sbin/nvidia-ctk
name: "0"
- containerEdits:
deviceNodes:
- path: /dev/nvidia0
gid: 27
- path: /dev/dri/card1
- path: /dev/dri/renderD129
hooks:
- args:
- nvidia-ctk
- hook
- create-symlinks
- --link
- ../card1::/dev/dri/by-path/pci-0000:01:00.0-card
- --link
- ../renderD129::/dev/dri/by-path/pci-0000:01:00.0-render
hookName: createContainer
path: /usr/sbin/nvidia-ctk
- args:
- nvidia-ctk
- hook
- chmod
- --mode
- "755"
- --path
- /dev/dri
hookName: createContainer
path: /usr/sbin/nvidia-ctk
name: all
kind: nvidia.com/gpu
Just to confirm, 27
in the above example is the numeric ID of the video
group?
Correct, 27 is video on gentoo.
@forestofrain I was trying to set up an instance to test this locally, but was running into some issues with the driver installation to get this going. Do you have a link to some docs on getting a working Gentoo GPU-based system? (This would be terminal only).
I have been able to dig a bit further on a OpenSUSE system with a similar device node configuration (the GID is different, but that should not affect the findings).
One thing to note when running "rootless" podman
is that the root:video
user-group combination on the host is mapped to nobody:nogroup
in the container, meaning that the device nodes show up as:
$ ls -al /dev/nvi*
crw-rw-rw- 1 nobody nogroup 236, 0 Jan 24 15:34 /dev/nvidia-uvm
crw-rw-rw- 1 nobody nogroup 236, 1 Jan 24 15:34 /dev/nvidia-uvm-tools
crw-rw---- 1 nobody nogroup 195, 0 Jan 24 15:34 /dev/nvidia0
crw-rw---- 1 nobody nogroup 195, 255 Jan 24 15:34 /dev/nvidiactl
Also note that in this case the low-level runtime (runc
) does not mknod
the devices with the properties from the OCI Runtime spec, but instead bind mounts them into the container. Note that the mode bitmask is not modified in this operation and the same 660
permissions from the host
One thing to note is that when a container is created in a userns, runc does not mknod
in the container, but instead bind mounts the device node into the container. In addition, the user and group are mapped to nobody
. I am not familiar enough with podmans uid and gid mappings to provide a solution off the top of my head.
Another update.
Looking at the following entry in the troubleshooting guide: https://github.com/containers/podman/blob/main/troubleshooting.md#20-passed-in-devices-or-files-cant-be-accessed-in-rootless-container
I confirmed that in my setup when running:
podman run --rm -ti --device nvidia.com/gpu=all --group-add keep-groups --runtime=crun ubuntu nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-cdd5cfb4-69a9-a04b-4c87-070d09c51772)
with crun
available gives the desired output.
Wheras with runc
it still fails:
podman run --rm -ti --device nvidia.com/gpu=all --group-add keep-groups --runtime=runc ubuntu nvidia-smi -L
Failed to initialize NVML: Insufficient Permissions
Note that there is also an entry that describes using uid and gid maps to achieve similar results: https://github.com/containers/podman/blob/main/troubleshooting.md#35-passed-in-devices-or-files-cant-be-accessed-in-rootless-container-uidgid-mapping-problem
@elezar thanks for the quick solution! Running your last commands, I get the same results.
Your solution also lead me to a Red Hat article that provided a nice config snippet that works.
[containers]
annotations=["run.oci.keep_original_groups=1",]
Now I can run this older Tensorflow container with less options on the command line :)
podman run --userns keep-id --rm -it --device nvidia.com/gpu=all tensorflow/tensorflow:2.11.0-gpu nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 (UUID: GPU-...)
Thanks for all the help.
Note that this solution may also work for https://github.com/NVIDIA/nvidia-container-runtime/issues/145
cc @qhaas
Failed to initialize NVML: Insufficient Permissions
.I only had to change one setting in
config.toml
for my system.I compared the relevant files to a fresh Ubuntu 23 install, where rootless worked. The only difference was the permissions on
/dev/nvidia*
. My distro, gentoo, installs a config that changes the defaults for device file parameters. This was introduced with a commit on 2021-07-21. The NVIDIA driver FAQ provides an example in the FAQ:How and when are the NVIDIA device files created?
This looks reasonable to me.
Is this a bug with the container toolkit or is it expected? I would assume that with
ModifyDeviceFiles = 1
that I should not have to change my distro config.Possible relevant information below.
Rootless Error
Rootless Success
After adding a config override with file mode
0666
, podman rootless works as expected.