NVIDIA / yum-packaging-precompiled-kmod

NVIDIA precompiled kernel module packaging for RHEL
Apache License 2.0
35 stars 16 forks source link

Investigate EL kernel version alignment #43

Open kmittman opened 1 year ago

kmittman commented 1 year ago

Investigate EL kernel version alignment

NVIDIA provided precompiled kmod RPMs only officially support RHEL kernels. These are built and tested on Red Hat Enterprise Linux for that specific kernel release. This blog post goes into more detail.

A frequently asked question is regarding technical reasons for why other RHEL-like kernels would not be compatible. The primary blocker is that in order to avoid any potential ABI incompatibility, the precompiled design requires a exact match of the kernel version string.

Let's look at some kernel-core data for

Pre-requisites

Rocky Linux and Alma Linux both archive packages from previous y-stream releases, so first enable those repos.

rockylinux:8 define old_releases=('8.6' '8.5' '8.4' '8.3') rockylinux:9 define old_releases=('9.0')

rockyvault="https://dl.rockylinux.org/vault/rocky"
for ver in ${old_releases[@]}; do
    repo="$rockyvault/$ver/BaseOS/x86_64/os"
    echo -e "[Rocky-Vault-$ver]\nname=Rocky-Vault-$ver\nbaseurl=$repo/\ngpgcheck=1\nenabled=1\ngpgkey=$repo/RPM-GPG-KEY-rockyofficial" | tee /etc/yum.repos.d/Rocky-Vault-$ver.repo
done

almalinux:8 define old_releases=('8.6' '8.5' '8.4' '8.3') almalinux:9 define old_releases=('9.0')

almavault="https://repo.almalinux.org/vault"
almagpg="https://repo.almalinux.org/almalinux/RPM-GPG-KEY-AlmaLinux-8"
for ver in ${old_releases[@]}; do
    repo="$almavault/$ver/BaseOS/x86_64/os"
    echo -e "[Alma-Vault-$ver]\nname=Alma-Vault-$ver\nbaseurl=$repo/\ngpgcheck=1\nenabled=1\ngpgkey=$almagpg" | tee /etc/yum.repos.d/Alma-Vault-$ver.repo
done

List kernel packages

dnf list kernel-core --showduplicates

# Filter output
dnf list kernel-core --showduplicates | awk '{print $2}' | grep "\.el" | sort -uV

Plot EL8 kernels

RHEL8 precompiled status page

-------------A--B--C--D------------------A--B--C--D---
| 1|    8.0 [+][ ][+][ ]    |30|        [+][+][+][+]
| 2|        [+][ ][+][ ]    |31|        [+][+][+][+]
| 3|        [+][ ][+][ ]    |32|        [+][+][+][+]
| 4|        [+][ ][+][ ]    |33|        [+][+][+][+]
| 5|        [+][ ][+][ ]    |34|        [+][+][+][+]
| 6|        [+][ ][+][ ]    |35|        [+][+][+][+]
| 7|        [+][ ][+][ ]    |36|    8.5 [+][+][+][+]
| 8|        [+][ ][+][ ]    |37|        [+][+][+][+]
| 9|    8.1 [+][ ][+][ ]    |38|        [+][+][+][+]
|10|        [+][ ][+][ ]    |39|        [+][+][+][+]
|11|        [+][ ][+][ ]    |40|        [+][+][+][+]
|12|        [+][ ][+][ ]    |41|        [+][+][+][+]
|13|        [+][ ][+][ ]    |42|    8.6 [+][+][+][+]
|14|        [+][ ][+][ ]    |43|        [+][+][ ][+]
|15|    8.2 [+][ ][+][ ]    |44|        [ ][ ][+][ ]
|16|        [+][ ][+][ ]    |45|        [+][+][ ][+]
|17|        [+][ ][+][ ]    |46|        [ ][+][ ][ ]
|18|        [+][ ][+][ ]    |47|        [ ][ ][+][ ]
|19|        [+][ ][+][ ]    |48|        [+][+][ ][+]
|20|        [+][ ][+][ ]    |49|        [ ][ ][+][ ]
|21|        [+][ ][+][ ]    |50|        [+][+][ ][+]
|22|    8.3 [+][ ][+][+]    |51|        [ ][ ][+][ ]
|23|        [+][ ][+][ ]    |52|        [+][+][ ][+]
|24|        [+][ ][+][ ]    |53|        [ ][ ][+][ ]
|25|        [+][ ][+][ ]    |54|    8.7 [+][+][+][+]
|26|        [+][ ][+][+]    |55|        [+][+][+][+]
|27|        [+][+][+][+]    |56|        [+][+][+][+]
|28|    8.4 [+][ ][+][+]    |57|        [+][+][+][+]
|29|        [+][+][+][+]                    
------------------------------------------------------
A. Red Hat Enterprise Linux
B. Rocky Linux
C. Oracle Linux
D. Alma Linux

Plot EL9 kernels

RHEL9 precompiled status page

-------------A--B--C--D--------------
| 1|    9.0 [+][ ][ ][+]
| 2|        [ ][ ][+][ ]
| 3|        [+][ ][ ][+]
| 4|        [ ][ ][+][ ]
| 5|        [+][ ][ ][+]
| 6|        [ ][ ][+][ ]
| 7|        [+][ ][ ][+]
| 8|        [ ][ ][+][ ]
| 9|        [+][+][ ][+]
|10|        [ ][ ][+][ ]
|11|    9.1 [+][ ][+][+]
|12|        [+][ ][+][+]
|13|        [+][ ][+][+]
|14|        [+][+][+][+]
-------------------------------------
A. Red Hat Enterprise Linux
B. Rocky Linux
C. Oracle Linux
D. Alma Linux

Summary

While there is some overlap with kernel versions, it is often the case where there is not overlap (missing kernels, versioned differently, etc.). This results in non-deterministic install behavior - depending on when the dnf transaction occurs.

To explain another way, for example, let's assume the kernels are aligned today and the precompiled install succeeds on machine A (RHEL-like) — however next week there is a new kernel released, it may not succeed on machine B (RHEL-like) because there is not a compatible kmod package available.

As such, attempts to use the precompiled modular streams provided in the CUDA repository on non-RHEL distros results in a degraded user experience and is not supported by NVIDIA.

Instead sysadmins are encouraged to build DIY precompiled kmod RPMs using the instructions provided in this git repo, otherwise the DKMS modular streams may be used.