How to support installing kernel modules

miabbott commented 5 years ago

Users may have a need to install kernel drivers on there hosts to support additional hardware. This could be required for boot (day 1 operation) or could be required after install to enable adapters (day 2 operation).

The straight-forward way to accomplish this is to package the drivers in RPM format, so that they can be installed via rpm-ostree install. Users may want to be able to build these drivers on an FCOS host, which would require a container with the necessary dependencies installed.

It would be useful to come up with a framework that is generic enough to be reused by multiple drivers and is possible to produce multiple versions of the driver (per kernel version).

Copying notes from @cgwalters below:

There are conceptually three phases, that are linked: how modules are built, how they're delivered, and finally how they're installed on the host. As I noted elsewhere I think we should make it easy to have a single container image supporting multiple kernel versions. Delivery would be something like /usr/lib/modules/$kver/foo.ko with multiple $kver in the contaienr. How they're installed gets tricky if we want to integrate with upgrades. Perhaps the simplest thing is to have RPMs of each kernel module that Require: their exact target kernel. Then the container content is provided to the host, we inject /etc/yum.repos.d/kmods-$provider.repo that points to it, and do rpm-ostree install kmod-$provider. Then on upgrade rpm-ostree will try to pick the right one, and fail if it's not available.

cgwalters commented 5 years ago

A hugely tricky question here is whether 3rd parties will want a mechanism that also works nearly the same for yum managed systems as well - how tolerant will they be of a distinct mechanism for FCOS? It may depend.

One thing I mentioned in the Silverblue+nvidia discussion is we could add rpm-ostree support for arbitrary hooks run during upgrades. Today %post from installed RPMs are constrained, but something like /etc/rpm-ostree/hooks.d that were passed as an argument the new target rootfs. That would allow near-total flexibility because that hook could just run a container that did whatever it wanted, from building a module to checking for a pre-built one; if a hook exited with failure that would also block the upgrade.

cgwalters commented 5 years ago

One useful pattern then would be having in Kubernetes a daemonset container inject its hook on startup into the host, ensuring that it got executed when an upgrade was attempted.

dustymabe commented 5 years ago

The easiest/cleanest approach is to have all kernel modules built for every kernel and provided via an rpm that requires that kernel. For example someone could set up a copr that triggers on every kernel build and builds a related kernel module rpm for that kernel. Then adding the yum repo and rpm-ostree installing the rpm should suffice, correct?

It's a lot uglier when we have to recompile on upgrade on the host. Especially when that host is supposed to be minimal (hence why you need to do it in a container).

cgwalters commented 5 years ago

https://github.com/projectatomic/rpm-ostree/pull/1882 is a quick hack I started on the hooks thing.

bgilbert commented 5 years ago

Then on upgrade rpm-ostree will try to pick the right one, and fail if it's not available.

@lucab If an upgrade fails, will Zincati retry later, or give up immediately? This seems like a case where a later retry might succeed.

lucab commented 5 years ago

Zincati will keep retrying after some delay, both when trying to stage (i.e. deploy --lock-finalization) a new release and when trying to finalize (i.e. finalize-deployment) a deployment (which it has previously successfully locally staged).

jlebon commented 5 years ago

The easiest/cleanest approach is to have all kernel modules built for every kernel and provided via an rpm that requires that kernel. For example someone could set up a copr that triggers on every kernel build and builds a related kernel module rpm for that kernel. Then adding the yum repo and rpm-ostree installing the rpm should suffice, correct?

I think I agree with this. It works just as well on FCOS/RHCOS as on traditional yum/dnf-managed systems. In the context of immutable host clusters, it makes more sense to me to build the kernel module once than have e.g. potentially thousands of nodes all compiling them on each upgrade. Not just for efficiency, but also for keeping down the number of things that could go wrong at upgrade time.

The flip side of this though is that we're then on the hook (pun intended) to provide tooling for this. Not everyone can use COPR. For RHCOS... maybe what we want is a way to hook into the update payload delivery flow so one can work on top of the new machine-os-content similarly to https://github.com/openshift/os/issues/382?

cgwalters commented 5 years ago

For example someone could set up a copr that triggers on every kernel build and builds a related kernel module rpm for that kernel. Then adding the yum repo and rpm-ostree installing the rpm should suffice, correct?

Yeah, this is a fine approach.

cgwalters commented 5 years ago

A slightly tricky thing here though at least for RHCOS is I'd like to support shipping the kernel modules in a container via e.g. daemonset - this is a real-world practice. Doing that with the "multi-version rpm-md repo" approach...hm, maybe simplest is actually to write a MachineConfig that injects the .repo file, and run a service that hosts the rpm-md repo.

cgwalters commented 5 years ago

Been thinking about this a lot lately and we've had a ton of discussions and the usual pile of private google docs. I want to emphasize how much I have come to agree with Dusty's comment.

One issue with this is that is that we don't have any direct package layering support in the MCD; we'd probably have to document dropping the /etc/yum.repos/d/nvidia.repo file and rpm-ostree install nvidia-module or whatever via a daemonset. But in the end that gunk could be wrapped up in a higher level nvidia-operator or whatever.

lucab commented 5 years ago

For reference, here is how people use to bring nvidia & wireguard modules to CL on k8s: https://github.com/squat/modulus

cgwalters commented 5 years ago

OK now I got convinced in another meeting that:

Exposing RPMs to users is too raw
Requiring a new service to build and maintain the RPM repo as kernel updates come in is not obvious

The core problem with atomic-wireguard and similar CL-related projects is they don't have a good way to do the "strong binding" I think is really important, to again block the upgrade if the kernel module won't work with the new kernel.

So that seems to take us back to https://github.com/projectatomic/rpm-ostree/pull/1882 which will be generally useful anyways.

dustymabe commented 5 years ago

OK now I got convinced in another meeting that:

Exposing RPMs to users is too raw

Requiring a new service to build and maintain the RPM repo as kernel updates come in is not obvious

hmm. exactly who are we concerned about exposing things to? Is it end users or is it module producers? For example, with wireguard we could work with the maintainer and set up one project that does the building of the rpms and creation of repos for each new kernel. So we expose the pain of the "build service" to one person (or small group of people) and the end users don't have pain. The end users simply add the yum repo and rpm-ostree install the rpm and it should work from then on.

imcleod commented 5 years ago

Dusty, I’m largely responsible for the back and forth on this so I’ll try to re-frame a bit here.

I’ll summarize one proposal in two points. To use an out of tree module on *COS:

1) The module must be packaged as an RPM, able to be rebuilt against a specific kernel and result in an RPM that has a hard dependency on the kernel it is compiled for. 2) Something must be responsible for maintaining a repo (public, cluster-local, org local, etc.) that is populated with these compiled RPMs for new *COS releases as they come out, and before the updated release becomes the target for a node update, or cluster update.

I’ve no doubt that if the two conditions above are met, the resulting behavior at the *COS level will be robust, bordering on bulletproof. Nothing prevents the community from trying to move forward with this.

I have two concerns.

Firstly, the existence of 2) above is problematic. In the product context (by which I mean OpenShift running on RHCOS) I’m getting hard pushback on the idea of introducing a new service/container that is responsible for hosting such a repo, and updating it with fresh RPM builds as needed, in coordination with the updates of the underlying *COS kernel. I don’t know what else to say on this point, other than that if we don’t have this repo, we do not have this solution.

My deeper concern is with point 1) above. Put bluntly I suspect that if we require RPM-ification as a prerequisite for third party modules on COS, we will get far fewer third party modules on COS.

To be clear, I’m not saying that it’s not possible to rpm-ify all desirable modules. What I am saying is that it’s extremely unlikely to happen organically. It has had plenty of time to happen organically on Fedora and RHEL and has not. There are very good tools and approaches that can be used to do this with RPMs and they come with many of the same advantages that the proposal outlined above would give. In spite of this, after over a decade and a half of RHEL and Fedora, some kernel third party modules are RPM-ified but many are not.

If, as I fear, it doesn’t happen organically, it will not happen. We simply do not have the bandwidth in the *COS teams and the broader community to maintain these SPECs and supporting scripts on our own, nor do we have the deployed base to provide the incentive to third parties to adopt this approach. (Again, if Fedora/RHEL/CentOS can’t drive this, how will we?)

What has happened organically in the kube/container space are variations on the approach best represented by Joe’s work on wireguard. I’d summarize this as:

1) Take the third party kernel material in whatever form it is currently delivered. 2) Automate the rebuild step, either within a container build task, or within a running container, using scripting of whatever mechanism is most appropriate for the material as delivered. 3) Define a minimal API-like interface to interact with these containers. Essentially: build, load, reload, unload and possibly “build for this pending kernel and err if it fails”

This is substantially less prescriptive than RPMs plus package layering and has the advantage of being container-native-ish and uses packaging/bundling techniques with a much larger user base (container builds and running containers).

Thoughts?

ashcrow commented 5 years ago

If, as I fear, it doesn’t happen organically, it will not happen. We simply do not have the bandwidth in the *COS teams and the broader community to maintain these SPECs and supporting scripts on our own, nor do we have the deployed base to provide the incentive to third parties to adopt this approach. (Again, if Fedora/RHEL/CentOS can’t drive this, how will we?) What has happened organically in the kube/container space are variations on the approach best represented by Joe’s work on wireguard

:+1:

This is substantially less prescriptive than RPMs plus package layering and has the advantage of being container-native-ish and uses packaging/bundling techniques with a much larger user base (container builds and running containers).

I agree with this. As noted, there isn't anything wrong with RPMs, package layering, etc.. in fact they are quite powerful .... but I tend to believe using OCI containers + builds has less friction as it already has uptake.

dustymabe commented 5 years ago

Put bluntly I suspect that if we require RPM-ification as a prerequisite for third party modules on COS, we will get far fewer third party modules on COS.

I figured most things that people in the Fedora/RHEL/CentOS ecosystem care about can already be delivered as an rpm. I didn't know this was that big of a blocker.

Take the third party kernel material in whatever form it is currently delivered.

Automate the rebuild step, either within a container build task, or within a running container, using scripting of whatever mechanism is most appropriate for the material as delivered.

Define a minimal API-like interface to interact with these containers. Essentially: build, load, reload, unload and possibly “build for this pending kernel and err if it fails”

Regarding steps 1/2 that exactly what I was proposing we do on the build side somewhere and then the output of that process would be rpms that could then be consumed. I think my whole point here is that it would be much cleaner to do it this way than it would be to add hooks to execute things on the host (that may or may not fail) that then modify the host on every upgrade.

I think you've laid out a few points about why it's too hard to do it that way.

cgwalters commented 5 years ago

My deeper concern is with point 1) above. Put bluntly I suspect that if we require RPM-ification as a prerequisite for third party modules on COS, we will get far fewer third party modules on COS.

As I've said, I am quite sure it'd be easy for us to provide a container image which accepts kernel module sources (or - potentially a pre-built module) and generates an RPM.

but I tend to believe using OCI containers + builds has less friction as it already has uptake.

But that doesn't solve the binding problem on its own. We're talking about kernel modules which execute fully on the host, so saying "OCI containers" is deceptive as it's really host tied. There's some blurry lines here about how much containers are used, but it's not just containers.

zvonkok commented 5 years ago

For the NVIDIA use-case, we have been using a DriverContainer for 3.10/3.11 (AtomicHost | RHEL) and for 4.x (RHCOS | RHEL). (https://gitlab.com/nvidia/container-images/driver)

The reference implementation of a GPU operator (https://github.com/openshift-psap/special-resource-operator) which NVIDIA uses as a template to implement their "official" GPU operator, uses the DriverContainer to install the drivers on a host (RHCOS or RHEL). Customers were using the DriverContainer successfully since 3.10.

NVIDIA uses source installs but we have created a DriverContainer that uses released RPMs, this way we are only using tested driver versions.

The GPU operator will check the kernel version and OS and deploy the correct DriverContainer to the node.

The benefits of a DriverContainer are, we can easily update drivers and libraries. In the case of NVIDIA a prestart hook injects libs,bins,config files from the DriverContainer into GPU workload contianers.

It works on RHCOS and RHEL

We are not touching the base OS

If the node gets updated the DriverContainer will not be scheduled on the new node since it has a nodeSelector on kernelversion, operating system and library version.

We can easily have several DriverContainers running on the same node to support several accelerator cards.

DriverContainers are taking care of mod load and unloading and starting of services that are needed for a specific accelerator card to work.

If one removes the DriverContainer it takes care of mod unloading and cleanup.

dustymabe commented 5 years ago

Thanks @zvonkok for the references! That should be useful

dustymabe commented 4 years ago

On this topic I have been looking recently at the atomic-wireguard implementation and have come up with a similar proof of concept called kmods-via-containers. Included in the project is a complete simple-kmod example. I have also done some work to make sure this works with a real world example. For that I used mellanox on RHEL 8.

cgwalters commented 4 years ago

One thing that came up on a RHT-internal thread is that we should probably support (for RHCOS) driver update disks there are apparently some vendors that make pre-built RPMs that actually use KABI - so they don't need to be rebuilt for kernel updates.

cgwalters commented 4 years ago

I think for OpenShift, we should focus on https://github.com/openshift-psap/special-resource-operator It has a lot of advantages in terms of integration, and is already relying on/solving thorny issues like subscription access.

lcnsir commented 4 years ago

@cgwalters thanks very much for your response on https://github.com/openshift/installer/issues/3761, So currently for RHCOS on OCP4, no day2 or day1 management method for us to install addtional kernel packages on Host by installer ,right ? even if https://github.com/openshift-psap/special-resource-operator ? is not fully supported

InfoSec812 commented 4 years ago

No additional technical content to add here, but I will say that I am seeing a lot more end users of OpenShift/CoreOS asking about this kind of functionality, especially to support their preferred security vendors. Tools like Falco, Sysdig, etc... It would be VERY useful to be able to say that there is a supported solution for getting kernel modules/settings into nodes without breaking the cluster.

wrender commented 3 years ago

If you create your own image, should you be able to install drivers/kernel modules? I can't seem to get this to work: https://www.itix.fr/blog/build-your-own-distribution-on-fedora-coreos/

dustymabe commented 3 years ago

If you create your own image, should you be able to install drivers/kernel modules?

Of course. If you create your own image you can do anything you like. By far the easiest way to get this to work (IMO) is to build your own RPM for the kernel module and include the rpm/yum repo in the definition manifest list that's fed into cosa (which runs rpm-ostree).

coreos / fedora-coreos-tracker

How to support installing kernel modules #249