NVIDIA / gpu-driver-container

The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.
Apache License 2.0
52 stars 29 forks source link

Migrated MR - Fedora CoreOS Stable/Testing/Next Support #8

Open chris-bateman opened 4 months ago

chris-bateman commented 4 months ago

Migrating this as we would like to see native support for Fedora CoreOS in this project. See - https://gitlab.com/nvidia/container-images/driver/-/issues/34

fifofonix commented 4 months ago

Summarizing the linked issue with what I believe to be the latest status.

There is no official support for Fedora and Fedora CoreOS (FCOS) presently (ie. no docker images are published automatically to the NVIDIA container registry). A community contribution under ci\fedora\.gitlab-ci-fcos.yml provides an alternate gitlab CICD file that can be used to build FCOS next/testing/stable driver container images with pre-compiled kernel headers, but it assumes the existence of FCOS gitlab-runners tracking the various streams ^1.

My fork of this project on gitlab has been using this strategy successfully for a couple of years now to build/push images to dockerhub in this way.

The resultant images are currently successfully used on typhoon k8s FCOS clusters. However, they are running nvidia-device-plugin with the container toolkit installed on the host and the driver container running outside of k8s. Although way back when I had gpu-operator running I have not been able to get that running more recently for reasons unknown.

^1 Note that it should be possible to build functioning images without pre-compiled kernel headers without using FCOS gitlab-runners.

zwshan commented 1 month ago

Hello, I followed the instructions in your document and executed the commands,

$ DRIVER_VERSION=535.154.05 # Check ci/fedora/.common-ci-fcos.yml for latest
$ FEDORA_VERSION_ID=$(cat /etc/os-release | grep VERSION_ID | cut -d = -f2)
$ podman run -d --privileged --pid=host \
     -v /run/nvidia:/run/nvidia:shared \
     -v /var/log:/var/log \
     --name nvidia-driver \
     registry.gitlab.com/container-toolkit-fcos/driver:${DRIVER_VERSION}-fedora$$FEDORA_VERSION_ID

but the container automatically changes to an 'exited' state, and it remains the same even after restarting. What should I do?

image
fifofonix commented 1 month ago

I expect that if you were to examine the podman logs for the container above you will see that there are compilation errors (difficult to see because there are a lot of expected warnings).

These errors occur with Fedora40 and won't be officially addressed by NVIDIA until they incorporate a patch they have advertised on their community forum. I have this patch applied on my fleet and it works fine but applying it is tricky.

See: https://gitlab.com/container-toolkit-fcos/driver/-/issues/11