NVIDIA / yum-packaging-precompiled-kmod

NVIDIA precompiled kernel module packaging for RHEL
Apache License 2.0
33 stars 16 forks source link

Kernel version string updated release tags #36

Open kmittman opened 1 year ago

kmittman commented 1 year ago

Hi folks,

Just wanted to quickly jump in related to this issue. Current Rocky Linux kernel is not ahead of RHEL, but the Release tags can differ for a few reasons. Mostly, and in this case, it's because we've had to republish the same version because of a non-technical change. Because we never publish same NVR twice, we can add .rocky or .0.1 in this case to the Release field.

kernel-4.18.0-372.16.1.el8_6.0.1 and kernel-4.18.0-372.16.1.el8_6 is in fact equal.

I'm not sure if there is a good way to make this work with kmods. Thankfully though we do keep all published artifacts so users can generally use the first released version using the dnf install method you provided above @kmittman.

Thanks, Mustafa Gezen Release Engineering lead @ Rocky Linux

Originally posted by @mstg in https://github.com/NVIDIA/yum-packaging-precompiled-kmod/issues/35#issuecomment-1194523399

kmittman commented 1 year ago

Hi @mstg Thank you for the clarification!

Given the new information, I'm not sure how to resolve this situation.

We use a Python script: https://github.com/NVIDIA/yum-packaging-nvidia-plugin/blob/rhel8/nvidia-dnf.py to pick the appropriate precompiled kmod dependency (and block kernel updates for unavailable kmods). This requires the kernel version string to match exactly.

Also, as I mentioned - officially this project only supports RHEL8 (and RHEL9) kernels. So I will need to ponder about how to unblock Rocky users.

mstg commented 1 year ago

Thanks for opening a new issue @kmittman.

I took a look at some of the kmods provided by Red Hat, and they seem to use >= rather than = when matching kernel versions. I know that it might not be very ideal, as someone can then install the kmod on a newer kernel that the kmod wasn't built for. Although it might be worth taking a look at.

Example: https://git.centos.org/rpms/kmod-kvdo/blob/c8/f/SPECS/kmod-kvdo.spec#_29

For now this seems to be the "only" solution that can be achieved using RPM only. We're evaluating options on doing Provides: original-NVR as well, so I'll let you know if we can fix this on our end (hopefully).

nazunalika commented 1 year ago

I'd like to provide my thoughts on this particular situation. I would highly recommend going with an approach like kmod-kvdo or the approach that the elrepo community members handle these types scenarios going forward. This means building an on an initial version and requiring it or newer when installing.

The elrepo approach:

The initial kernel version is defined and can be overridden in special circumstances. The build requires the version defined. It does not have the be the latest. It just has to be an initial or older version during EL's 6 month cycle for that kernel release. Ultimately, it requires equal or newer than it was built with. It also uses weak modules (check the %prep tag).

kvdo does it a bit differently but the premise is the same:

Initial version is defined Build requires it or newer Ultimately, it requires equal or newer than it was built with. It also uses weak modules (see lines 49, 50, 83-93).

Changing our spec file to fix a packaging issue outside of our ecosystem would not be the right approach, in my opinion. Using the approaches above would alleviate issues for all EL derivatives. In Fedora, it makes sense to be constantly rebuilding and have very specific requirements (when not using akmods for example). In enterprise linux, it makes less sense because the kernel release version (kernel-4.18.0-XXX in this example) changes once every 6 months.

michaelbarkdoll commented 1 year ago

Just wanted to say that we're experiencing this issue again on Rocky Linux 9.

kmittman commented 1 year ago

@michaelbarkdoll Can you clarify if this is an upgrade or a fresh install?

michaelbarkdoll commented 1 year ago

@michaelbarkdoll Can you clarify if this is an upgrade or a fresh install?

  • If upgrade then dnf-plugin-nvidia should hold back the Rocky kernel for which there is no matching kmod

It was an upgrade.

michaelbarkdoll commented 1 year ago

https://forums.developer.nvidia.com/t/nvidia-driver-not-booting/241056

nazunalika commented 1 year ago

The strictness of the kernel version in the kmod packaging here is what is ultimately causing this issue for RL9 users. I would recommend the packaging/builds be eased to how elrepo, rpmfusion, or even red hat (using kmod-kvdo as the example) does their packaging. See my previous comment for more details.

There shouldn't be a reason that an added .0.X at the end of a kernel version should cause a kmod update to not succeed in any manner. Holding back a kernel shouldn't have to happen either in my opinion. This causes issues for users who are using the nvidia repositories for their drivers rather than elrepo or rpmfusion.

In the rpmfusion case, when you dnf install kmod-nvidia which is a meta package, it pulls kmod-nvidia-5.14.0-162.el9_1... which means this was specifically built for 5.14.0-162, regardless of the updates that come after. Here are the requires as a result:

. . .
/usr/sbin/depmod
/usr/sbin/depmod
/usr/sbin/weak-modules
/usr/sbin/weak-modules
kernel >= 5.14.0-162.el9_1
kernel < 5.14.0-163.el9_1
. . .

This specifically wants anything equal or greater than 5.14.0-162.el9_1 but less than 5.14.0-163.el9_1. This is logical as 162 is for 9.1, and it will be higher in 9.2 come May, and this driver should just work on any updates that come to 5.14.0-162.

Here is their spec file for reference.