NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.42k stars 260 forks source link

Rootless podman 'Failed to initialize NVML: Insufficient Permissions' on OpenSUSE Tumbleweed #268

Open RlndVt opened 2 years ago

RlndVt commented 2 years ago

1. Issue

$ podman run --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
Failed to initialize NVML: Insufficient Permissions

On OpenSUSE Tumbleweed fwiw.

2. Steps to reproduce the issue

$ nvidia-smi and $ sudo nvidia-smi work.

$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#no-cgroups = true
#user = "root:video"
user = "root:root"
ldconfig = "@/sbin/ldconfig"
#ldconfig = "/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
#debug = "/tmp/nvidia-container-runtime.log"
$ sudo podman run --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
Tue Feb 15 15:54:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P400         Off  | 00000000:01:00.0 Off |                  N/A |
| 34%   23C    P8    N/A /  N/A |      2MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

After toggling no-cgroups = false to no-cgroups = true:

$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-0418f928fa7a07a3556432a296aa4ad39c33a716309117f20367f130c7a34b48.scope 
INFO[0000] Got Conmon PID as 12406  
$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-41c656a076287283f96001ffe442d4bb077993a46553167120c07d7b8c532861.scope 
INFO[0000] Got Conmon PID as 12581                      
Failed to initialize NVML: Insufficient Permissions

3. Information to attach (optional if deemed irrelevant)

-- WARNING, the following logs are for debugging purposes only --

I0215 15:58:47.072670 12695 nvc.c:376] initializing library context (version=1.8.0, build=05959222fe4ce312c121f30c9334157ecaaee260) I0215 15:58:47.072790 12695 nvc.c:350] using root / I0215 15:58:47.072823 12695 nvc.c:351] using ldcache /etc/ld.so.cache I0215 15:58:47.072839 12695 nvc.c:352] using unprivileged user 1000:1000 I0215 15:58:47.072902 12695 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0215 15:58:47.073124 12695 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0215 15:58:47.074559 12696 nvc.c:273] failed to set inheritable capabilities W0215 15:58:47.074655 12696 nvc.c:274] skipping kernel modules load due to failure I0215 15:58:47.075261 12697 rpc.c:71] starting driver rpc service I0215 15:58:47.081207 12699 rpc.c:71] starting nvcgo rpc service I0215 15:58:47.081744 12695 nvc_info.c:759] requesting driver information with '' I0215 15:58:47.082662 12695 nvc_info.c:172] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.470.103.01 I0215 15:58:47.082756 12695 nvc_info.c:172] selecting /usr/lib64/libnvoptix.so.470.103.01 I0215 15:58:47.082795 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-tls.so.470.103.01 I0215 15:58:47.082816 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-rtcore.so.470.103.01 I0215 15:58:47.082860 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.470.103.01 I0215 15:58:47.082879 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-opticalflow.so.470.103.01 I0215 15:58:47.082904 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-opencl.so.470.103.01 I0215 15:58:47.082925 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ngx.so.470.103.01 I0215 15:58:47.082946 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ml.so.470.103.01 I0215 15:58:47.082974 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-ifr.so.470.103.01 I0215 15:58:47.082992 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glvkspirv.so.470.103.01 I0215 15:58:47.083011 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glsi.so.470.103.01 I0215 15:58:47.083031 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-glcore.so.470.103.01 I0215 15:58:47.083052 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-fbc.so.470.103.01 I0215 15:58:47.083072 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-encode.so.470.103.01 I0215 15:58:47.083090 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-eglcore.so.470.103.01 I0215 15:58:47.083107 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-compiler.so.470.103.01 I0215 15:58:47.083125 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-cfg.so.470.103.01 I0215 15:58:47.083143 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-cbl.so.470.103.01 I0215 15:58:47.083161 12695 nvc_info.c:172] selecting /usr/lib64/libnvidia-allocator.so.470.103.01 I0215 15:58:47.083182 12695 nvc_info.c:172] selecting /usr/lib64/libnvcuvid.so.470.103.01 I0215 15:58:47.083260 12695 nvc_info.c:172] selecting /usr/lib64/libcuda.so.470.103.01 I0215 15:58:47.083306 12695 nvc_info.c:172] selecting /usr/lib64/libGLX_nvidia.so.470.103.01 I0215 15:58:47.083325 12695 nvc_info.c:172] selecting /usr/lib64/libGLESv2_nvidia.so.470.103.01 I0215 15:58:47.083344 12695 nvc_info.c:172] selecting /usr/lib64/libGLESv1_CM_nvidia.so.470.103.01 I0215 15:58:47.083361 12695 nvc_info.c:172] selecting /usr/lib64/libEGL_nvidia.so.470.103.01 I0215 15:58:47.083384 12695 nvc_info.c:172] selecting /usr/lib/vdpau/libvdpau_nvidia.so.470.103.01 I0215 15:58:47.083406 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-tls.so.470.103.01 I0215 15:58:47.083424 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ptxjitcompiler.so.470.103.01 I0215 15:58:47.083442 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-opticalflow.so.470.103.01 I0215 15:58:47.083459 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-opencl.so.470.103.01 I0215 15:58:47.083476 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ml.so.470.103.01 I0215 15:58:47.083495 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-ifr.so.470.103.01 I0215 15:58:47.083513 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glvkspirv.so.470.103.01 I0215 15:58:47.083530 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glsi.so.470.103.01 I0215 15:58:47.083547 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-glcore.so.470.103.01 I0215 15:58:47.083565 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-fbc.so.470.103.01 I0215 15:58:47.083582 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-encode.so.470.103.01 I0215 15:58:47.083599 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-eglcore.so.470.103.01 I0215 15:58:47.083617 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-compiler.so.470.103.01 I0215 15:58:47.083636 12695 nvc_info.c:172] selecting /usr/lib/libnvidia-allocator.so.470.103.01 I0215 15:58:47.083655 12695 nvc_info.c:172] selecting /usr/lib/libnvcuvid.so.470.103.01 I0215 15:58:47.083680 12695 nvc_info.c:172] selecting /usr/lib/libcuda.so.470.103.01 I0215 15:58:47.083707 12695 nvc_info.c:172] selecting /usr/lib/libGLX_nvidia.so.470.103.01 I0215 15:58:47.083726 12695 nvc_info.c:172] selecting /usr/lib/libGLESv2_nvidia.so.470.103.01 I0215 15:58:47.083744 12695 nvc_info.c:172] selecting /usr/lib/libGLESv1_CM_nvidia.so.470.103.01 I0215 15:58:47.083763 12695 nvc_info.c:172] selecting /usr/lib/libEGL_nvidia.so.470.103.01 W0215 15:58:47.083773 12695 nvc_info.c:398] missing library libnvidia-nscq.so W0215 15:58:47.083777 12695 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so W0215 15:58:47.083781 12695 nvc_info.c:398] missing library libnvidia-pkcs11.so W0215 15:58:47.083785 12695 nvc_info.c:402] missing compat32 library libnvidia-cfg.so W0215 15:58:47.083789 12695 nvc_info.c:402] missing compat32 library libnvidia-nscq.so W0215 15:58:47.083793 12695 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so W0215 15:58:47.083796 12695 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so W0215 15:58:47.083799 12695 nvc_info.c:402] missing compat32 library libnvidia-ngx.so W0215 15:58:47.083802 12695 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so W0215 15:58:47.083805 12695 nvc_info.c:402] missing compat32 library libnvoptix.so W0215 15:58:47.083808 12695 nvc_info.c:402] missing compat32 library libnvidia-cbl.so I0215 15:58:47.083959 12695 nvc_info.c:298] selecting /usr/bin/nvidia-smi I0215 15:58:47.083971 12695 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump I0215 15:58:47.083982 12695 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced I0215 15:58:47.083996 12695 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control I0215 15:58:47.084006 12695 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server W0215 15:58:47.084016 12695 nvc_info.c:424] missing binary nv-fabricmanager I0215 15:58:47.084032 12695 nvc_info.c:342] listing firmware path /usr/lib/firmware/nvidia/470.103.01/gsp.bin I0215 15:58:47.084045 12695 nvc_info.c:522] listing device /dev/nvidiactl I0215 15:58:47.084048 12695 nvc_info.c:522] listing device /dev/nvidia-uvm I0215 15:58:47.084052 12695 nvc_info.c:522] listing device /dev/nvidia-uvm-tools I0215 15:58:47.084055 12695 nvc_info.c:522] listing device /dev/nvidia-modeset W0215 15:58:47.084068 12695 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket W0215 15:58:47.084080 12695 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket W0215 15:58:47.084090 12695 nvc_info.c:348] missing ipc path /tmp/nvidia-mps I0215 15:58:47.084093 12695 nvc_info.c:815] requesting device information with '' I0215 15:58:47.089567 12695 nvc_info.c:706] listing device /dev/nvidia0 (GPU-08283365-4b53-3311-bff5-d5c37f82021d at 00000000:01:00.0) NVRM version: 470.103.01 CUDA version: 11.4

Device Index: 0 Device Minor: 0 Model: Quadro P400 Brand: Quadro GPU UUID: GPU-08283365-4b53-3311-bff5-d5c37f82021d Bus Location: 00000000:01:00.0 Architecture: 6.1 I0215 15:58:47.089598 12695 nvc.c:430] shutting down library context I0215 15:58:47.089625 12699 rpc.c:95] terminating nvcgo rpc service I0215 15:58:47.089927 12695 rpc.c:135] nvcgo rpc service terminated successfully I0215 15:58:47.090430 12697 rpc.c:95] terminating driver rpc service I0215 15:58:47.090551 12695 rpc.c:135] driver rpc service terminated successfully

 - [x] Kernel version from `uname -a`

Linux satellite 5.16.8-1-default NVIDIA/nvidia-docker#1 SMP PREEMPT Thu Feb 10 11:31:59 UTC 2022 (5d1f5d2) x86_64 x86_64 x86_64 GNU/Linux

 - [x] Driver information from `nvidia-smi -a`

$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Tue Feb 15 17:01:29 2022 Driver Version : 470.103.01 CUDA Version : 11.4

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : Quadro P400 Product Brand : Quadro Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1422521034591 GPU UUID : GPU-08283365-4b53-3311-bff5-d5c37f82021d Minor Number : 0 VBIOS Version : 86.07.8F.00.02 MultiGPU Board : No Board ID : 0x100 GPU Part Number : 900-5G178-1701-000 Module ID : 0 Inforom Version Image Version : G178.0500.00.02 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1CB310DE Bus Id : 00000000:01:00.0 Sub System Id : 0x11BE10DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 34 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 2000 MiB Used : 2 MiB Free : 1998 MiB BAR1 Memory Usage Total : 256 MiB Used : 4 MiB Free : 252 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A

 - [x] Podman version from `podman version`

$ podman version Version: 3.4.4 API Version: 3.4.4 Go Version: go1.13.15 Built: Thu Dec 9 01:00:00 2021 OS/Arch: linux/amd64

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

$ nvidia-container-cli -V cli-version: 1.8.0 lib-version: 1.8.0 build date: 2022-02-04T09:21+00:00 build revision: 05959222fe4ce312c121f30c9334157ecaaee260 build compiler: gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

klueska commented 2 years ago

If you set no-cgroups=true then nvidia-docker will not set up the cgroups for any of your GPUs and NVML will not be able to attach to talk to them (unless you pass the device references yourself on the podman command line). Can you explain what you are trying to do by turning this flag on?

RlndVt commented 2 years ago

I was following the steps as laid out here:

To be able to run rootless containers with podman, we need the following configuration change to the NVIDIA runtime: sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml

Without toggling cgroups:

$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/      nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-dba911758792809762f61b4a5819b849b63a003176bf052674c5b9b533ea701e.scope 
Error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: OCI permission denied

E:

(unless you pass the device references yourself on the podman command line)

If this is the solution, how do I pass the device references?

klueska commented 2 years ago

Ah right, with podman you can have no-cgroup=true and not explicity pass the device list (because podman will infer the cgroups that need to be setup for the bind-mounted dev files that get passed in). With docker / containerd this is not the case, and I got confused.

Given the error that you have, it appears that you are attempting to run this on a system with cgroupv2 enabled. The fact that you set no-cgroups = true though, means that you should not be going down this path (unless of course there is a bug in the code that allows this).

I will need to look into it further and get back to you.

RlndVt commented 2 years ago

For the record, I have no preference for running with no-cgroups=true. If rootless can work without, that is fine by me.

klueska commented 2 years ago

Can you give me a bit more info about your setup? I spun up an ubuntu21.10 image on AWS, installed podman on it, set no-cgroups=true and was able to get things working as expected:

$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
$ podman run --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   docker.io/nvidia/cuda:11.0-base nvidia-smi
Wed Feb 16 12:24:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    36W / 300W |      0MiB / 16384MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I didn't try on OpenSUSE Tumbleweed, but I can't imagine what might be different there in terms of this.

Can you update to version 1.8.1 of the toolkit (released on Monday) and see if one of the bugs we fixed there was relevant to your issue?

RlndVt commented 2 years ago

Can you give me a bit more info about your setup?

Specifically this is on a system running the transactional-server Tumbleweed, so with a immutable root. (This caused a issue which was recently fixed where some mounts caused a error when it could not be mounted rw.) To install nvidia-container-toolkit I had to add the 15.1 repo (which I swear was called 15.x when I did) from here: https://nvidia.github.io/nvidia-docker/.

$ zypper lr -pr
#  | Alias                                 | Name                                  | Enabled | GPG Check | Refresh | Priority
---+---------------------------------------+---------------------------------------+---------+-----------+---------+---------
 1 | NVIDIA                                | NVIDIA                                | Yes     | (r ) Yes  | Yes     |   99
 2 | libnvidia-container                   | libnvidia-container                   | Yes     | (r ) Yes  | No      |   99
 3 | libnvidia-container-experimental      | libnvidia-container-experimental      | No      | ----      | ----    |   99
 4 | nvidia-container-runtime              | nvidia-container-runtime              | Yes     | (r ) Yes  | No      |   99
 5 | nvidia-container-runtime-experimental | nvidia-container-runtime-experimental | No      | ----      | ----    |   99
 6 | nvidia-docker                         | nvidia-docker                         | Yes     | (r ) Yes  | No      |   99
 7 | openSUSE-20211107-0                   | openSUSE-20211107-0                   | No      | ----      | ----    |   99
 8 | repo-debug                            | openSUSE-Tumbleweed-Debug             | No      | ----      | ----    |   99
 9 | repo-non-oss                          | openSUSE-Tumbleweed-Non-Oss           | Yes     | (r ) Yes  | Yes     |   99
10 | repo-oss                              | openSUSE-Tumbleweed-Oss               | Yes     | (r ) Yes  | Yes     |   99
11 | repo-source                           | openSUSE-Tumbleweed-Source            | No      | ----      | ----    |   99
12 | repo-update                           | openSUSE-Tumbleweed-Update            | Yes     | (r ) Yes  | Yes     |   99

$ zypper se -s nvidia-container-toolkit
Loading repository data...
Reading installed packages...

S | Name                     | Type    | Version | Arch   | Repository
--+--------------------------+---------+---------+--------+-------------------------
i | nvidia-container-toolkit | package | 1.8.1-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.8.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.7.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.6.0-1 | x86_64 | libnvidia-container
v | nvidia-container-toolkit | package | 1.5.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.5.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.2-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.4.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.3.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.2.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.2.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.2-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.1-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.1.0-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.0.5-1 | x86_64 | nvidia-container-runtime
v | nvidia-container-toolkit | package | 1.0.4-1 | x86_64 | nvidia-container-runtime

Can you update to version 1.8.1 of the toolkit (released on Monday) and see if one of the bugs we fixed there was relevant to your issue?

$ nvidia-container-cli -V
cli-version: 1.8.1
lib-version: 1.8.1
build date: 2022-02-14T12:05+00:00
build revision: abd4e14d8cb923e2a70b7dcfee55fbc16bffa353
build compiler: gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-f74ea1c8a0116a3bcb970f03eb97b400c65b80a77dfefb997f6faf776c3e1982.scope 
INFO[0004] Got Conmon PID as 27212                      
Failed to initialize NVML: Insufficient Permissions
INFO[0004] Container f74ea1c8a0116a3bcb970f03eb97b400c65b80a77dfefb997f6faf776c3e1982 was already removed, skipping --rm
klueska commented 2 years ago

I will try and bring up a similar setup soon. Could you check if maybe this is relevant in the meantime: https://github.com/NVIDIA/nvidia-docker/issues/1547#issuecomment-1041565769

RlndVt commented 2 years ago

I had seen that issue and I got the best results (I think) while using user=root:root. I see that in your attempt you do not specify a user, haven't tried that yet.

$ ls -l /dev/nvidia*
crw-rw---- 1 root video 195,   0 Feb 15 19:37 /dev/nvidia0
crw-rw---- 1 root video 195, 255 Feb 15 19:37 /dev/nvidiactl
crw-rw---- 1 root video 195, 254 Feb 15 19:37 /dev/nvidia-modeset
crw-rw---- 1 root video 238,   0 Feb 15 19:37 /dev/nvidia-uvm
crw-rw---- 1 root video 238,   1 Feb 15 19:37 /dev/nvidia-uvm-tools
$ groups
[me] users wheel video

With user=root:video and without a user specified i get the same result:

$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   nvidia/cuda:11.0-base nvidia-smi
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-94a161a6829d7307ffb7ff7288f8ebe83475cf37e65bd487c425081f84ac9252.scope 
Error: OCI runtime error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: insufficient permissions
$ podman run --log-level=info --rm --security-opt=label=disable      --hooks-dir=/usr/share/containers/oci/hooks.d/   nvidia/cuda:11.0-base
INFO[0000] podman filtering at log level info           
INFO[0000] Found CNI network podman (type=bridge) at /home/[me]/.config/cni/net.d/87-podman.conflist 
INFO[0000] Setting parallel job count to 25             
INFO[0000] Running conmon under slice user.slice and unitName libpod-conmon-46c6aada867c650652a20e26ad81001dd592029b6a319a3f7b520afc309e7a18.scope 
Error: OCI runtime error: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: insufficient permissions