chris-bateman commented 8 months ago

Migrating this as we would like to see native support for Fedora CoreOS in this project. See - https://gitlab.com/nvidia/container-images/driver/-/issues/34

fifofonix commented 7 months ago

Summarizing the linked issue with what I believe to be the latest status.

There is no official support for Fedora and Fedora CoreOS (FCOS) presently (ie. no docker images are published automatically to the NVIDIA container registry). A community contribution under ci\fedora\.gitlab-ci-fcos.yml provides an alternate gitlab CICD file that can be used to build FCOS next/testing/stable driver container images with pre-compiled kernel headers, but it assumes the existence of FCOS gitlab-runners tracking the various streams ^1.

My fork of this project on gitlab has been using this strategy successfully for a couple of years now to build/push images to dockerhub in this way.

The resultant images are currently successfully used on typhoon k8s FCOS clusters. However, they are running nvidia-device-plugin with the container toolkit installed on the host and the driver container running outside of k8s. Although way back when I had gpu-operator running I have not been able to get that running more recently for reasons unknown.

^1 Note that it should be possible to build functioning images without pre-compiled kernel headers without using FCOS gitlab-runners.

zwshan commented 5 months ago

Hello, I followed the instructions in your document and executed the commands,

$ DRIVER_VERSION=535.154.05 # Check ci/fedora/.common-ci-fcos.yml for latest
$ FEDORA_VERSION_ID=$(cat /etc/os-release | grep VERSION_ID | cut -d = -f2)
$ podman run -d --privileged --pid=host \
     -v /run/nvidia:/run/nvidia:shared \
     -v /var/log:/var/log \
     --name nvidia-driver \
     registry.gitlab.com/container-toolkit-fcos/driver:${DRIVER_VERSION}-fedora$$FEDORA_VERSION_ID

but the container automatically changes to an 'exited' state, and it remains the same even after restarting. What should I do?

fifofonix commented 5 months ago

I expect that if you were to examine the podman logs for the container above you will see that there are compilation errors (difficult to see because there are a lot of expected warnings).

These errors occur with Fedora40 and won't be officially addressed by NVIDIA until they incorporate a patch they have advertised on their community forum. I have this patch applied on my fleet and it works fine but applying it is tricky.

See: https://gitlab.com/container-toolkit-fcos/driver/-/issues/11

r0k5t4r commented 3 months ago

Hi, happy to join your talks here.

I'm currently also struggling to get Nvidia drivers and CUDA working in an EOL FCOS release e.g. 37. I managed to get it working by using the rpm-ostree method. Even though redoing the steps yesterday didn't work. I figured out. that there is a new version of the Nvidia-container-toolkit 1.16.1.1 that seems to break things. Reverting back to 1.16.0.1 everything works fine.

I also tried your containers fifonix but we ended up with the same issue zwshan. Checking the logs and your docker file you have hardcoded HTTP_PROXY set in your run commands. I guess this is why the container exits with -1. At least that is what we can see on our side.

Can you please check / confirm?

Furthermore I'm no expert on FCOS.

fifofonix commented 3 months ago

@r0k5t4r

You are correct that I have embedded unreachable proxies in the driver images and that will most certainly be a problem if you are running a driver image that does not have pre-compiled kernel headers. I will fix this in the next couple of weeks but for new images only.

When running a kernel-specific driver image with pre-compiled kernel headers I do not even launch it with a network interface since the image does not need network access at all. I'm skeptical that the gpu-operator issue being experienced is due to this therefore. However, perhaps you can run a test creating your own image minus the environment variables and launching that successfully?

Also, interested in why you would be running such an old FCOS image (with security issues etc)?

Finally, I just want to share that right now I'm running 550.90.07 on each of the current Fedora40 next/testing/stable streams on k8s nodes without issue using toolkit 1.16.1-1. However, I am running the driver as a podman systemd unit on each of the gpu worker nodes and deploying nvidia-device-plugin via helm because I could not get gpu-operator to launch the driver as a container successfully. I have not gone back in the past 3 months to re-try getting gpu-operator to work but it is on my long term aspiration to do list. Note also my fork of gpu-driver-container is running a little behind right now on driver versions etc I think and I have a to do to go and rebase to adopt al the latest good work in the project.

r0k5t4r commented 3 months ago

@fifofonix thanks for your reply.

I have to run this old FCOS version since it is used in OpenStack Magnum. I have to run K8s 1.21.x and even if I would run a newer K8s release OpenStack Magnum still wouldn't work with an up to date version of FCOS. For newer K8s releases I use Vexxhosts CAPI driver in OpenStack Magnum. This can use either Ubuntu or Flatcar as base OS.

On another post https://github.com/NVIDIA/gpu-operator/issues/696 I also mentioned that I managed to get GPU Operator working without driver deployment. IMHO this is the best solution as it manages everything. Especially when using OpenStack Magnum. My idea is that the deployment should be simple and effortless. This is exactly what you can achieve with the NVIDIA GPU Operator. Next I will try to get the driver deployment through GPU Operator working too.

The only reason it is not working yet, is that once the GPU Operator deploys the driver container, it tries to install kernel headers and they just don't exist. At least on the latest FCOS37 and some older FCOS35 release that I'm using.

fifofonix commented 3 months ago

Thanks for the links @r0k5t4r, I had missed some of the updates on the gpu-operator thread. Of course this is the way to go and I look forward to experimenting with it soon!

r0k5t4r commented 2 months ago

Hi,

I made some progress. I spend plenty of time looking into the problem with the driver deployment through gpu operator and noticed / fixed a couple of things.

Since I’m using a very old fcos35 release that is eol , the repo URL’s are different. I added some logic to check if the container builds on an eol release and if so it changes the repos to archive.

furthermore it seems that the driver fails to build due to a gcc mismstch. The environment variable to disable this, seems to be ignored. So I also added some logic to download the correct version from Koji.

Now I can not just successfully build a container but also run it without any problems directly in Podman on my fcos35 nodes. I can run NVIDIA-smi just fine. Also the NVIDIA modules are listed when running lsmod on the host os.

But for some reason kubernetes is still not happy with it. It compiles the driver, installs the driver but then it unloads it, destroys the pod and starts all over again. Maybe an issue with the probe? I don’t know.

I have not yet put my code online but I’m planning to do this within next week.

I will try to precompile the driver on the container, commit and push it to my repo. And try using this.

Cheers, Oliver

dfateyev commented 2 months ago

@r0k5t4r thanks for your efforts. I would like to add some details here, which can be useful.

Building the driver in the image is not a problem itself: we have images with working drivers. The main issue is they are not pretty suitable for K8s. When we managed driver "working" and image probe "succeeded", there was an operator's component which validated running driver container and blamed its "validity". Last time, I stopped at this stage. Maybe, it is better now.
Your POD is destroyed by K8s scheduler which consider it unhealthy after probing. To debug it in details, you can extract the driver-related Deployment from gpu-operator suite, and deploy your driver image with your Deployment. There you can customize or skip probing to see what failed there.
Running driver-container in Podman outside K8s generally works, the problem is to manage it running under K8s gpu-operator natively.
I believe the optimal way to deliver the driver is "precompiled": in the nutshell, we do not need to (re-)compile the installed driver. I did some experiments on it earlier, you can check if you're interested.

r0k5t4r commented 2 months ago

@dfateyev you're welcome. OpenSource relies heavily on contribution. :) And it is interesting to investigate on this issue and hopefully fix it one day.

Thanks for your input. Very valuable. 👍

Your POD is destroyed by K8s scheduler which consider it unhealthy after probing. To debug it in details, you can extract the driver-related Deployment from gpu-operator suite, and deploy your driver image with your Deployment. There you can customize or skip probing to see what failed there.

Good idea to debug the driver deployment like this. I will try it.

So but even with your precompiled container it is not working yet with GPU-Operator in K8s?

I also tried to use a precompiled container, but I noticed that this was first supported in GPU-operator version 23.3.0 while I have to run a much older release since we need K8s 1.21.x. :)

So no precompiled containers for me.

r0k5t4r commented 2 months ago

I found this very good articles from CERN and they seem to have successfully used one of @fifofonix containers with magnum and the gpu operator. I tried using the same version but it did not work. I don’t know which k8s version they used. I’m running 1.21.11.

https://kubernetes.web.cern.ch/blog/2023/03/17/efficient-access-to-shared-gpu-resources-part-2/

NVIDIA / gpu-driver-container

Migrated MR - Fedora CoreOS Stable/Testing/Next Support #8

Your POD is destroyed by K8s scheduler which consider it unhealthy after probing. To debug it in details, you can extract the driver-related Deployment from gpu-operator suite, and deploy your driver image with your Deployment. There you can customize or skip probing to see what failed there.