"Failed to initialize NVML: Insufficient Permissions" when running nvidia-smi in nvidia/cuda docker

JohanAR commented 2 years ago

Problem:

~ ❯❯❯ docker run --rm --gpus all --runtime nvidia nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Failed to initialize NVML: Insufficient Permissions

Background: I replaced moby-engine on Fedora 36 with docker-ce from https://download.docker.com/linux/fedora/docker-ce.repo because I thought that was necessary to use nvidia docker. That worked perfectly fine for a few days until it stopped working after an update, so I thought I'd follow @elezar 's tip to try to run it with moby instead. I removed all the packages that I got from docker-ce and disabled that repo. Installed moby-engine and nvidia-container-toolkit, but when running nvidia-smi in docker no longer works (i.e. it did with docker-ce and nvidia-docker2) because of SELinux stuff. However it seems like I could still access the GPU from a different docker image, which was running Stable Diffusion.

Possibly not related, but when installing the package container-selinux (a dependecy of moby-engine) it freezes close to 10 minutes while running a scriptlet. After that it continues and looks like it succeeded.

System info:

~ ❯❯❯ uname -a
Linux johan-pc 5.19.8-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Sep 8 19:02:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
~ ❯❯❯ cat /etc/os-release
NAME="Fedora Linux"
VERSION="36 (KDE Plasma)"
ID=fedora
VERSION_ID=36
VERSION_CODENAME=""
PLATFORM_ID="platform:f36"
PRETTY_NAME="Fedora Linux 36 (KDE Plasma)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:36"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f36/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=36
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=36
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="KDE Plasma"
VARIANT_ID=kde
~ ❯❯❯ dnf list installed | grep nvidia
akmod-nvidia.x86_64                                  3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
kmod-nvidia-5.19.6-200.fc36.x86_64.x86_64            3:515.65.01-1.fc36                  @@commandline             
kmod-nvidia-5.19.7-200.fc36.x86_64.x86_64            3:515.65.01-1.fc36                  @@commandline             
kmod-nvidia-5.19.8-200.fc36.x86_64.x86_64            3:515.65.01-1.fc36                  @@commandline             
libnvidia-container-tools.x86_64                     1.11.0-1                            @libnvidia-container      
libnvidia-container1.x86_64                          1.11.0-1                            @libnvidia-container      
nvidia-container-toolkit.x86_64                      1.11.0-1                            @libnvidia-container      
nvidia-container-toolkit-base.x86_64                 1.11.0-1                            @libnvidia-container      
nvidia-gpu-firmware.noarch                           20220815-139.fc36                   @updates                  
nvidia-persistenced.x86_64                           3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
nvidia-settings.x86_64                               3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia.x86_64                           3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-cuda.x86_64                      3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-cuda-libs.i686                   3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-cuda-libs.x86_64                 3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-kmodsrc.x86_64                   3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-libs.i686                        3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-libs.x86_64                      3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
xorg-x11-drv-nvidia-power.x86_64                     3:515.65.01-1.fc36                  @rpmfusion-nonfree-updates
~ ❯❯❯ dnf list installed | grep moby
moby-engine.x86_64                                   20.10.18-1.fc36                     @updates                  
~ ❯❯❯ dnf list installed | grep containerd
containerd.x86_64                                    1.6.8-2.fc36                        @updates

Syslog:

Sep 16 14:07:43 johan-pc audit[6779]: AVC avc:  denied  { read } for  pid=6779 comm="nvidia-smi" name="params" dev="tmpfs" ino=2 scontext=system_u:system_r:container_t:s0:c56,c465 tcontext=system_u:object_r:container_runtime_tmpfs_t:s0 tclass=file permissive=0
Sep 16 14:07:43 johan-pc audit[6779]: AVC avc:  denied  { getattr } for  pid=6779 comm="nvidia-smi" path="/dev/nvidiactl" dev="devtmpfs" ino=1124 scontext=system_u:system_r:container_t:s0:c56,c465 tcontext=system_u:object_r:xserver_misc_device_t:s0 tclass=chr_file permissive=0
Sep 16 14:07:43 johan-pc audit[6779]: AVC avc:  denied  { read } for  pid=6779 comm="nvidia-smi" name="params" dev="tmpfs" ino=2 scontext=system_u:system_r:container_t:s0:c56,c465 tcontext=system_u:object_r:container_runtime_tmpfs_t:s0 tclass=file permissive=0
Sep 16 14:07:43 johan-pc audit[6779]: AVC avc:  denied  { getattr } for  pid=6779 comm="nvidia-smi" path="/dev/nvidiactl" dev="devtmpfs" ino=1124 scontext=system_u:system_r:container_t:s0:c56,c465 tcontext=system_u:object_r:xserver_misc_device_t:s0 tclass=chr_file permissive=0
Sep 16 14:07:43 johan-pc audit[6779]: AVC avc:  denied  { read } for  pid=6779 comm="nvidia-smi" name="nvidiactl" dev="devtmpfs" ino=1124 scontext=system_u:system_r:container_t:s0:c56,c465 tcontext=system_u:object_r:xserver_misc_device_t:s0 tclass=chr_file permissive=0

JohanAR commented 2 years ago

Running this command allows me to run nvidia-smi in docker setsebool -P container_use_devices 1

But then Stable Diffusion says that it runs out of VRAM when starting up.. Worked perfectly fine despite the audit denied before.. I have no idea what's going on :(

elezar commented 2 years ago

@JohanAR could you downgrade to NVIDIA Container Toolkit 1.10.0 (including the libnvidia-container* packages) to check whether this is a regression in the new version of the toolkit?

JohanAR commented 2 years ago

@elezar downgraded, rebooted and tried again but no difference

/s/P/AUTOMATIC111-sd-webui ❯❯❯ rpm -qa | grep nvidia                                                                   ✘ 4 docker_stuff 63✭ 1✱ 1◼
xorg-x11-drv-nvidia-kmodsrc-515.65.01-1.fc36.x86_64
xorg-x11-drv-nvidia-cuda-libs-515.65.01-1.fc36.x86_64
xorg-x11-drv-nvidia-libs-515.65.01-1.fc36.x86_64
nvidia-settings-515.65.01-1.fc36.x86_64
xorg-x11-drv-nvidia-power-515.65.01-1.fc36.x86_64
xorg-x11-drv-nvidia-515.65.01-1.fc36.x86_64
nvidia-persistenced-515.65.01-1.fc36.x86_64
xorg-x11-drv-nvidia-libs-515.65.01-1.fc36.i686
xorg-x11-drv-nvidia-cuda-libs-515.65.01-1.fc36.i686
kmod-nvidia-5.19.7-200.fc36.x86_64-515.65.01-1.fc36.x86_64
kmod-nvidia-5.19.8-200.fc36.x86_64-515.65.01-1.fc36.x86_64
xorg-x11-drv-nvidia-cuda-515.65.01-1.fc36.x86_64
akmod-nvidia-515.65.01-1.fc36.x86_64
nvidia-vaapi-driver-0.0.6-11.fc36.x86_64
nvidia-gpu-firmware-20220815-139.fc36.noarch
nvidia-modprobe-515.65.01-1.fc36.x86_64
kmod-nvidia-515.65.01-1.fc36.x86_64
kmod-nvidia-5.19.9-200.fc36.x86_64-515.65.01-1.fc36.x86_64
nvidia-xconfig-515.65.01-1.fc36.x86_64
libnvidia-container1-1.10.0-1.x86_64
libnvidia-container-tools-1.10.0-1.x86_64
nvidia-container-toolkit-1.10.0-1.x86_64

/s/P/AUTOMATIC111-sd-webui ❯❯❯ docker run --rm --gpus all --runtime nvidia nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi  docker_stuff 63✭ 1✱ 1◼
Failed to initialize NVML: Insufficient Permissions

Uninstalling moby-engine and switching to docker-ce again works fine with 1.10.0-1 versions. The above command to run nvidia-smi worked immediately, but I had to recreate my other docker images for them to be able to access CUDA. Maybe that's normal for SELinux

JohanAR commented 2 years ago

Don't know if relevant, but there was an update to the container-selinux package. At least it sounds like it could be related to selinux permissions for docker containers, but it's just a guess. Haven't had time and motivation to try going back to moby-engine since it's currently working for me.

NVIDIA / nvidia-container-toolkit

"Failed to initialize NVML: Insufficient Permissions" when running nvidia-smi in nvidia/cuda docker #33