Open betsegaw opened 5 months ago
Hi @betsegaw
It seems more of a docker/nvidia-ctk issue, as the output from nvidia-core22.smi
looks reasonable.
The output the following would be useful to help narrow it down:
snap connections docker
snap logs -n 80 docker.nvidia-container-toolkit
cat /var/snap/docker/current/config/daemon.json
cat /var/snap/docker/current/etc/nvidia-container-runtime/config.toml
cat /var/snap/docker/current/etc/cdi/nvidia.yaml
Hello, I'm jumping in as I have the exact same issue. The CDI setup fails during Docker Snap installation, because some libs are not found.
snap connections docker
snap connections docker
Interface Plug Slot Notes
content - docker:docker-executables -
content - docker:docker-registry-certificates -
content[graphics-core22] docker:graphics-core22 nvidia-core22:graphics-core22 -
docker docker:docker-cli docker:docker-daemon -
docker-support docker:privileged :docker-support -
docker-support docker:support :docker-support -
firewall-control docker:firewall-control :firewall-control -
home docker:home :home -
log-observe docker:log-observe - -
network docker:network :network -
network-bind docker:network-bind :network-bind -
network-control docker:network-control :network-control -
opengl docker:opengl :opengl -
removable-media docker:removable-media - -
Note that the graphics-core22 connection is in place
snap logs -n 80 docker.nvidia-container-toolkit
2024-08-23T08:59:13Z systemd[1]: Starting Service for snap application docker.nvidia-container-toolkit...
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: NVIDIA hardware detected: 01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] (rev a1)
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: 01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: Waiting for device to become available: /dev/nvidiactl
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: Checking device: 0/10
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: Device found
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: NVIDIA ready
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=info msg="Auto-detected mode as \"nvml\""
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=info msg="Selecting /dev/nvidia0 as /dev/nvidia0"
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=info msg="Selecting /dev/dri/card0 as /dev/dri/card0"
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=warning msg="Could not locate /dev/dri/controlD64: pattern /dev/dri/controlD64 not found"
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=info msg="Selecting /dev/dri/renderD128 as /dev/dri/renderD128"
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=info msg="Using driver version 535.183.01"
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[2031]: time="2024-08-23T08:59:13Z" level=error msg="failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.183.01: pattern libcuda.so.535.183.01 not found"
2024-08-23T08:59:13Z docker.nvidia-container-toolkit[1992]: WARNING: Conainter Toolkit setup seemed to fail with an error
2024-08-23T08:59:13Z systemd[1]: snap.docker.nvidia-container-toolkit.service: Deactivated successfully.
2024-08-23T08:59:13Z systemd[1]: Finished Service for snap application docker.nvidia-container-toolkit.
2024-08-23T08:59:14Z systemd[1]: Starting Service for snap application docker.nvidia-container-toolkit...
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: NVIDIA hardware detected: 01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080 Ti] (rev a1)
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: 01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: Waiting for device to become available: /dev/nvidiactl
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: Checking device: 0/10
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: Device found
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: NVIDIA ready
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=info msg="Auto-detected mode as \"nvml\""
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=info msg="Selecting /dev/nvidia0 as /dev/nvidia0"
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=info msg="Selecting /dev/dri/card0 as /dev/dri/card0"
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=warning msg="Could not locate /dev/dri/controlD64: pattern /dev/dri/controlD64 not found"
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=info msg="Selecting /dev/dri/renderD128 as /dev/dri/renderD128"
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=info msg="Using driver version 535.183.01"
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2235]: time="2024-08-23T08:59:14Z" level=error msg="failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for common entities: failed to create discoverer for driver files: failed to create discoverer for driver libraries: failed to get libraries for driver version: failed to locate libcuda.so.535.183.01: pattern libcuda.so.535.183.01 not found"
2024-08-23T08:59:14Z docker.nvidia-container-toolkit[2199]: WARNING: Conainter Toolkit setup seemed to fail with an error
2024-08-23T08:59:14Z systemd[1]: snap.docker.nvidia-container-toolkit.service: Deactivated successfully.
2024-08-23T08:59:14Z systemd[1]: Finished Service for snap application docker.nvidia-container-toolkit.
cat /var/snap/docker/current/config/daemon.json
{
"log-level": "error"
}
cat /var/snap/docker/current/etc/nvidia-container-runtime/config.toml
cat /var/snap/docker/current/etc/cdi/nvidia.yaml
These files do not exist. Their parent folder exist but are empty.
This is not surprising as reading the docker snap source code, there is a cleanup when the setup fails (see # Setup failure recovery #
)
Further reading this source code, the cdi_generate
function does some setup to look for the libs inside "${SNAP}/graphics". This should be provided by the graphics-core22 Snap interface. Which according to previous logs appears to be properly connected.
The CDI error mentions libcuda.so.535.183.01 not found
, but I also confirm the file exists:
ls -la /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/ | grep libcuda.so
lrwxrwxrwx 1 root root 12 May 23 10:28 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 21 May 23 10:28 libcuda.so.1 -> libcuda.so.535.183.01
-rw-r--r-- 1 root root 29380816 May 12 19:53 libcuda.so.535.183.01
Checking the journalctl logs I can see some Apparmor errors related to nvidia-ctk during the setup:
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="symlink" profile="snap.docker.nvidia-container-toolkit" name="/dev/char/195:255" pid=2235 comm="nvidia-ctk" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="symlink" profile="snap.docker.nvidia-container-toolkit" name="/dev/char/195:0" pid=2235 comm="nvidia-ctk" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/config" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/config" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/config" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/monitor" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/monitor" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="open" profile="snap.docker.nvidia-container-toolkit" name="/proc/driver/nvidia/capabilities/mig/monitor" pid=2235 comm="nvidia-ctk" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Aug 23 08:59:14 audit[2235]: AVC apparmor="DENIED" operation="symlink" profile="snap.docker.nvidia-container-toolkit" name="/dev/char/508:0" pid=2235 comm="nvidia-ctk" requested_mask="c" denied_mask="c" fsuid=0 ouid=0
I edited /var/lib/snapd/apparmor/profiles/snap.docker.nvidia-container-toolkit
to remove these errors, but the issue still persists.
Update I think this was fixed on the Docker Snap side, as installing the current beta version fixes the issue (at least on my side):
sudo snap install docker --channel latest/beta
Hi @ackanir
Thanks for pointing out is works ok in the current beta revision.
I didn't realise that the docker snap revision 2932
[ for amd64 arch ] was not in stable yet. It fixes a few issues with the initial implementation of the nvidia support, and also adds nvidia support for classic systems.
I've asked about promoting it here
I think we should close this issue, as it's really about the docker snap.
FYI - revision 2932
is now in stable.
I suggest closing this, and opening a new issue in the docker-snap repo if you believe you have a docker related issue.
I have installed Ubuntu Core 22 on a pc that has a NVIDIA 3090 GPU. Previously, this worked fine with Docker when I had Ubuntu server installed but since moving to Ubuntu Core 22, I have not been able to see the nvidia runtime in the docker snap. Also, the output of
lspci
(included below) seems to indicate that it hasn't recognized it as a 3D Controller. Since I don't see any issues, perhaps I am missing a step?Some relevant info
Output of
nvidia-core22.smi
Output of
ls -la /snap/nvidia-core22/current/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.161.08
Output of
sudo docker info | grep Runtime
Snap's installed Ubuntu Core 22
Output of
sudo snap logs nvidia-assemble
Error found when running
sudo snap logs docker
Output of
sudo pciutils.lspci -v | grep -i nvidia
Truncated output of
modinfo nvidia