[RFE] Improve Nvidia driver situation

jepio commented 2 years ago

TODO: currently just a brain dump

work with Nvidia to get driver-container solution working
maybe publish pre-built drivers with every release for direct consumption? sysext?
publish flatcar development container images to a container registry for easier consumption
optimize build speed? bzip2 -> lbzip2; skip emerge-gitclone and emerge coreos-sources steps since /lib/modules is sufficient for building modules
allow users to override NVIDIA_DRIVER_VERSION (some older GPUs are not supported in newest drivers)
Integrate nvidia-container-toolkit to make docker run --gpus all work out-of-the-box?

Current situation

[ Please describe the current situation you would like to have improved ]

Impact

[ Please describe the impact the lack of the feature requested is creating ]

Ideal future situation

[ Please describe the future situation after the improvement was implemented ]

Implementation options

[ Optional: please provide one or more options for implementing the feature requested ]

Additional information

[ Please Add any information that does not fit into any of the above sections here ]

jepio commented 2 years ago

Would love to hear user feedback/pain-points since you know best what currently works/doesn't.

studioph commented 2 years ago

I’ll preface by saying that I’m using Flatcar on diskless compute nodes that are booted via iPXE and provisioned each boot. I mention this because some of the things I highlight below may not apply to a more traditional installation scenario of Flatcar (until an update probably). However, I view Flatcar’s ability to be a stateless OS to be a major strength and attraction, so I hope that whatever solutions that come out of this can work in those scenarios as well.

Current situation

Adding GPU drivers to Flatcar relies on userspace tools to add/provision, which creates issues when you have containers that need GPUs. What I mean by this is the current approach is to use dev containers to build the drivers and kernel modules – which requires the container runtime to already be running – so any workload containers that require GPUs won’t start until after the driver step has completed. This adds specific startup ordering requirements to workloads that ideally shouldn’t be there.

Impact

Either very specific startup ordering needs to be managed with something like systemd, or in some cases SSHing into the instance and manually starting/restarting containers.

Ideal future situation

Something where GPU capabilities can be added/provisioned before the regular userspace/containers are started (i.e. with Ignition or some other similar tool)

Implementation options

I think all of the ideas you listed are great, I see it really as a matter of the balance between the amount of work the Flatcar team has to do vs end users. Just some of my own thoughts:

Pre-built drivers/sysext – this would obviously be the easiest for end users, but the most work for you guys. That being said, having something like either pre-built binaries or a sysext image that can be downloaded and added via ignition would be amazing, and as far as I know, unique to how every other distro works with Nvidia drivers.
I think publishing container images is probably the best balance as it makes it more accessible for people to build custom modules, or even custom kernels. I think it makes a clear separation between what should be the responsibility of the Flatcar team (Flatcar related stuff) vs the responsibility of the end users (everyone’s software needs are different). It also avoids the potential for scope creep in the future – if we make some kind of special case/exception for Nvidia today, what other software will come along down the road and people will be asking for the same thing
- Optimizing build speed should probably be a given if publishing dev containers
You may already be familiar with https://github.com/mediadepot/docker-flatcar-nvidia-driver, and while in a lot of ways it works well, it does require explicitly mounting each nvidia device file to the container, which is both, annoying, and I could see it causing problems in a system with multiple gpus vs being able to refer to them via name or other alias provided by the Nvidia runtime.
While it would be nice to have the nvidia runtime work with docker, from what I can tell the Kubernetes operator does not require the container toolkit to work, so I’m not sure how much demand there is for it to work on plain docker with the trend towards kubernetes.

Additional information

I had some related thoughts on Intel iGPUs I wanted to share, but I’m not sure if you’d prefer that in a separate issue and keep this thread strictly about Nvidia rather than GPUs more generally.

shsamkit commented 11 months ago

@studioph

While it would be nice to have the nvidia runtime work with docker, from what I can tell the Kubernetes operator does not require the container toolkit to work, so I’m not sure how much demand there is for it to work on plain docker with the trend towards kubernetes.

Could you share some reference for Kubernetes with GPU on flatcar? I have been trying to work the GPU operator with this but it looks like there is a depenendency on the runtime. Also, the documented driver container isn't building because of a mount of the write-protected directories.

sdlarsen commented 5 months ago

A way to get a build of the NVIDIA Container Toolkit would be nice. Ideally resembling the way the driver is installed by enabling the nvidia.service.

jepio commented 5 months ago

I had an early sysext of Nvidia container toolkit sysext here: https://github.com/jepio/flatcar-nvidia-overlay/releases/tag/v1.0 but it's outdated and required some manual config to enable.

Now that we have a the sysext bakery repo it should be significantly easier to experiment with this.

heilerich commented 5 months ago

I have been running the NVIDIA GPU Operator on flatcar in multiple clusters for some years now using it to manage the driver as well as the toolkit.

The requirement of doing this are currently the following:

Provisioning a systemd unit with an overlay mount for /usr/local on the flatcar nodes
A custom driver container for flatcar (forked from an effort abandoned by nvidia a couple of years ago)
Some custom settings for the operator

In general this works well, but maintaining the driver container has become incresingly annoying, because the team maintaining the GPU Operator make some assumptions about paths and in general about how the their official driver containers do things. This has become a frequent source of problems after version upgrades.

Also the /usr/local overlay seems counter intuitive to how flatcar works and breaks if one is using sysexts or the new nvidia.service driver systemd unit.

I have also experimented with combining the current flatcar nvidia.service systemd solution to install the driver and GPU Operator with the driver component disabled, i.e. just installing the toolkit and the management & observability features via the operator. This also works, but requires even more hacks, since the official NVIDIA toolkit and driver containers are also devloped in tandem and make assumptions about paths etc.

I currently see two ways forward that would vastly improve the GPU situation on flatcar for us:

(A) An effort by the Flatcar team and/or the community to create a driver container (and maybe toolkit container) that work on flatcar and are compatbile with GPU Operator. I think it would not be too hard to also make these containers work for users which do not run kubernetes or the operator to provide an 'out of the box' GPU experience.
(B) A sysext, systemd unit/script, container solution that ships with flatcar. If there was a solution that included driver and toolkit, this could be used with GPU Operator by disabling the container and toolkit installation features.

From my current expierence just providing the driver and using the operator to install the toolkit seem like a rather complicated and unstable solution, because those two are so tightly coupled.

jepio commented 4 months ago

I've opened a draft PR https://github.com/flatcar/scripts/pull/1705 to get nvidia-container-toolkit included in base Flatcar, would welcome some testing/feedback. Built images are available here for a couple of weeks: https://bincache.flatcar-linux.net/images/amd64/3888.0.0+nvidia-container.2/. Seems like this would greatly simplify the user experience.

(A) An effort by the Flatcar team and/or the community to create a driver container

I had a driver container working at some point, even submitted the PR upstream but there was never any response. Lets see if i can update it and we could do our own publishing.

(B) A sysext, systemd unit/script, container solution that ships with flatcar.

The idea I have here is to move to pre-built driver sysexts based on open-gpu-kernel modules. We can pre-build and publish a driver sysext for every release.

The challenge I see is supporting different combinations of NVIDIA driver versions. I think it is basically required for users to have control over the driver version, and publishing every driver version for every flatcar release people may still be using is impractical. We've seen this also for ClusterAPI and Flatcar+Kubernetes versions: the matrix of versions required to support every scenario is too big.

So I think what we'll end up with is pre-built sysexts, nvidia.service and the the nvidia driver container to support increasingly more complex usecases.

Seperately from this, someone approached me at FOSDEM about support for RDMA/MOFED. I have never tried these myself, so if anyone has any information on those please speak up.

jepio commented 4 months ago

Tried to deploy the GPU operator on the image above, with driver.enabled and toolkit.enabled set to false. The nvidia-operator-validator checks that nvidia-smi is available at /usr/bin/nvidia-smi, so I'm thinking that needs to either be a symlink, or nvidia.service can setup a small sysext.

heilerich commented 4 months ago

I can confirm that a symlink in /usr/bin/nvidia-smi will work. We have been running a test system with this for a couple of weeks now.

Currently, gpu-operator requires two filesystem modifications to work

A writable overlay at /usr/local because the operator expects to write some files there and the mount points are not configurable. This might go away with both the toolkit and the driver component disabled?
An overlay mount at /usr/bin so /usr/bin/nvidia-smi can be symlinked

Integrating setting up the nvidia-smi link into the service / sysext would be a great help.

Also the following setting on the operator must be set

 validator:
      driver:
        env:
        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
          value: "true"

Otherwise the driver validation phase fails while trying to create some symlinks that the nvidia.service already creates

jepio commented 4 months ago

A writable overlay at /usr/local because the operator expects to write some files there and the mount points are not configurable. This might go away with both the toolkit and the driver component disabled?

This seems to be required for the nvidia-container-toolkit installation, so it goes away if the toolkit is preinstalled. I also see that you can override this (possibly):

kubectl patch clusterpolicy/cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/toolkit/installDir", "value": "/opt/nvidia"}]'

I'll add a symlink into flatcar to put /usr/local/nvidia into /var so that people don't hit this.

DISABLE_DEV_CHAR_SYMLINK_CREATION

I didn't hit this issue

heilerich commented 4 months ago

DISABLE_DEV_CHAR_SYMLINK_CREATION

This happens during the driver validation phase of the gpu-operator-validator Pod. Possibly, the driver is not validated if the toolkit installation is disabled? The preinstalled toolkit scenario is the only I have not tested yet.

jepio commented 4 months ago

Oh yeah, now I hit it. I didn't in my initial testing because i symlinked /opt/nvidia/current to /run/extensions/nvidia-driver, and applied it to /usr/lib/modules as an overlay (systemd-sysext merge), which allows the gpu-operator-validator to proceed.

I've created a PoC sysext that deploys the nvidia-container-toolkit as a sysext. If anyone is interested in trying it (tested on 3760.2.0): https://gist.github.com/jepio/d3a71d7e973278130367c7844242f616/raw/93cf6a8f12c9d14a37a92e448f528c0025eeca80/nvidia-runtime.raw. Save this as /etc/extensions/nvidia-runtime.raw.

With this sysext applied I can successfully deploy gpu-operator with driver.enabled=false and toolkit.enabled=false.

jepio commented 4 months ago

Another thing that I came up with is: we can add a user oem-postinst hook that prebuilds the nvidia driver during update installation so that on next boot the driver can be directly loaded. Still need to try it and I figure this might be "best effort"/opt-in.

guhuajun commented 2 months ago

Another thing that I came up with is: we can add a user oem-postinst hook that prebuilds the nvidia driver during update installation so that on next boot the driver can be directly loaded. Still need to try it and I figure this might be "best effort"/opt-in.

Willing to see! I tried flatcar with nVidia RTX 2080 TI on a ESXi installation, but failed with wrong parameter. On the same ESXi installation, normal Ubuntu is working as expected. Please let me know if you want me to be a tester for this scenario.

heilerich commented 3 weeks ago

The sysext above works well on our test system with gpu-operator and driver/toolkit.enabled=false.

We had to cleanup some files on nodes that had a gpu-operator installed driver/toolkit before and run systemd-tmpfiles --create in order to get it running but had no problems otherwise so far.

/etc/containerd/config.toml
/etc/nvidia-container-runtime/config.toml

I like the idea of having an oem-postinst hook for prebuilding the driver.

@jepio Is the build script for your sysext available somewhere?

jepio commented 3 weeks ago

I think this draft PR to sysext-bakery has the script I used: https://github.com/flatcar/sysext-bakery/pull/51/files (sorry if it doesn't work anymore).

I had to shift focus for a while, but will try to come back to this.

flatcar / Flatcar