Igalia / webkit-container-sdk

The all-in-one SDK for WebKit GTK/WPE port development.
MIT License
8 stars 4 forks source link

Nvidia driver update might prevent container startup due to outdated /etc/cdi/nvidia.yaml bind mount entries #23

Open lauromoura opened 1 month ago

lauromoura commented 1 month ago

After upgrading the nvidia driver on the host system, the container failed to restart, leading to wkdev-enter getting stuck. When starting manually with podman start wkdev, podman failed with the following message:

Error: unable to start container "ecdb2132becbc09015d0bb7d14299693a75a462be160a2a4fc7610d60e0d3f2a": crun: error stat'ing file /lib/x86_64-linux-gnu/libEGL_nvidia.so.535.161.07: No such file or directory: OCI runtime attempted to invoke a command that was not found

In the host system, that file was upgraded to libEGL_nvidia.so.535.171.04 after the driver update.

As @TingPing pointed in Matrix, this file was listed in /etc/cdi/nvidia.yaml as a bind mount (see below). After removing the file and re-running wkdev-setup-nvidia-gpu-for-container, it pointed to the newer library. (but I couldn't check if only this was enough, as the container had been rebuilt after updating the base image). But in any case, future driver updates might trigger similar issues.

- containerPath: /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.535.171.07
    hostPath: /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.535.171.07
    options:
    - ro
    - nosuid
    - nodev
    - bind
TingPing commented 1 month ago

From Niko:

that won't work for the case of nvidia since we don't directly bind-mount something, but use the cdi support of podman, to pass on the nvidia.yml config file, generated by the nvidia container toolkit tool

a potential fix, is to dump the nvidia lib paths using nvidia-ctk tool, then cache that output, and compare it every time you enter a container

if that changes, abort, and tell the user to run wkdev-setup-nvidia-gpu and to use wkdev-update and recreate containers there's no way around recreating containers, if the CDI definitions change