canonical / nvidia-core22

GNU General Public License v3.0
0 stars 4 forks source link

NVIDIA runtime doesn't show up in Docker snap on Ubuntu Core 22 #18

Open betsegaw opened 5 months ago

betsegaw commented 5 months ago

I have installed Ubuntu Core 22 on a pc that has a NVIDIA 3090 GPU. Previously, this worked fine with Docker when I had Ubuntu server installed but since moving to Ubuntu Core 22, I have not been able to see the nvidia runtime in the docker snap. Also, the output of lspci (included below) seems to indicate that it hasn't recognized it as a 3D Controller. Since I don't see any issues, perhaps I am missing a step?

Some relevant info

Output of nvidia-core22.smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:21:00.0 Off |                  N/A |
|  0%   37C    P8              29W / 420W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Output of ls -la /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.08

-rw-r--r-- 1 root root 1946840 Mar  5 22:13 /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.08

Output of sudo docker info | grep Runtime

Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc

Snap's installed Ubuntu Core 22

Name                     Version                Rev    Tracking            Publisher   Notes
core                     16-2.61.2              16928  latest/stable       canonical✓  core
core20                   20240416               2318   latest/stable       canonical✓  base
core22                   20240408               1380   latest/stable       canonical✓  base
docker                   24.0.5                 2915   latest/stable       canonical✓  -
k9s                      v0.27.4                155    latest/stable       derailed    -
microk8s                 v1.30.0                6783   1.30-strict/stable  canonical✓  -
nvidia-assemble          3-36-gb8b0680          62     22/stable           xnox        -
nvidia-core22            535.161.08+mesa23.2.1  40     latest/stable       canonical✓  -
pc                       22-0.3                 146    22/stable           canonical✓  gadget
pc-kernel                5.15.0-107.117.1       1833   22/stable           canonical✓  kernel
pciutils                 3.3.1-3                3      latest/stable       woodrow     -
snapd                    2.62                   21465  latest/stable       canonical✓  snapd

Output of sudo snap logs nvidia-assemble

024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[1767]: + mknod -m 666 /dev/nvidiactl c 195 255
2024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[1767]: + mknod -m 666 /dev/nvidia-modeset c 195 254
2024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[2226]: + sed -n s|^\([0-9]*\) nvidia-uvm$|\1|p /proc/devices
2024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[1767]: + major=504
2024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[1767]: + [ -n 504 ]
2024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[1767]: + mknod -m 666 /dev/nvidia-uvm c 504 0
2024-05-18T05:37:17Z nvidia-assemble.nvidia-assemble[1767]: + mknod -m 666 /dev/nvidia-uvm-tools c 504 1
2024-05-18T05:37:17Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Deactivated successfully.
2024-05-18T05:37:17Z systemd[1]: Finished Service for snap application nvidia-assemble.nvidia-assemble.
2024-05-18T05:37:17Z systemd[1]: snap.nvidia-assemble.nvidia-assemble.service: Consumed 1.273s CPU time.

Error found when running sudo snap logs docker

failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.161.08: pattern libcuda.so.535.161.08 not found"

Output of sudo pciutils.lspci -v | grep -i nvidia

21:00.0 VGA compatible controller: NVIDIA Corporation Device 2204 (rev a1) (prog-if 00 [VGA controller])
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nvidia_drm, nvidia
21:00.1 Audio device: NVIDIA Corporation Device 1aef (rev a1)

Truncated output of modinfo nvidia

filename:       /lib/modules/5.15.0-107-generic/kernel/nvidia-535srv/nvidia.ko
firmware:       nvidia/535.161.08/gsp_tu10x.bin
firmware:       nvidia/535.161.08/gsp_ga10x.bin
alias:          char-major-195-*
version:        535.161.08
supported:      external
license:        NVIDIA
srcversion:     1C6DE25E8197E808964F4CF
alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
depends:        drm
retpoline:      Y
name:           nvidia
vermagic:       5.15.0-107-generic SMP mod_unload modversions 
jocado commented 5 months ago

Hi @betsegaw

It seems more of a docker/nvidia-ctk issue, as the output from nvidia-core22.smi looks reasonable.

The output the following would be useful to help narrow it down:

ackanir commented 2 months ago

Hello, I'm jumping in as I have the exact same issue. The CDI setup fails during Docker Snap installation, because some libs are not found.

These files do not exist. Their parent folder exist but are empty. This is not surprising as reading the docker snap source code, there is a cleanup when the setup fails (see # Setup failure recovery #)

Further reading this source code, the cdi_generate function does some setup to look for the libs inside "${SNAP}/graphics". This should be provided by the graphics-core22 Snap interface. Which according to previous logs appears to be properly connected.

The CDI error mentions libcuda.so.535.183.01 not found, but I also confirm the file exists:

ls -la /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/ | grep libcuda.so
lrwxrwxrwx 1 root root        12 May 23 10:28 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root        21 May 23 10:28 libcuda.so.1 -> libcuda.so.535.183.01
-rw-r--r-- 1 root root  29380816 May 12 19:53 libcuda.so.535.183.01

Checking the journalctl logs I can see some Apparmor errors related to nvidia-ctk during the setup:

Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="symlink" profile="snap.docker.nvidia-container-toolkit" name="/dev/char/195:255" pid=2235 comm="nvidia-ctk" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="symlink" profile="snap.docker.nvidia-container-toolkit" name="/dev/char/195:0" pid=2235 comm="nvidia-ctk" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/config" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/config" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/config" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/monitor" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/monitor" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/monitor" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="symlink" profile="snap.docker.nvidia-container-toolkit" name="/dev/char/508:0" pid=2235 comm="nvidia-ctk" requested_mask="c" denied_mask="c" fsuid=0 ouid=0

I edited /var/lib/snapd/apparmor/profiles/snap.docker.nvidia-container-toolkit to remove these errors, but the issue still persists.

ackanir commented 2 months ago

Update I think this was fixed on the Docker Snap side, as installing the current beta version fixes the issue (at least on my side):

sudo snap install docker --channel latest/beta
jocado commented 2 months ago

Hi @ackanir

Thanks for pointing out is works ok in the current beta revision.

I didn't realise that the docker snap revision 2932 [ for amd64 arch ] was not in stable yet. It fixes a few issues with the initial implementation of the nvidia support, and also adds nvidia support for classic systems.

I've asked about promoting it here

I think we should close this issue, as it's really about the docker snap.

jocado commented 1 month ago

FYI - revision 2932 is now in stable.

I suggest closing this, and opening a new issue in the docker-snap repo if you believe you have a docker related issue.