NVIDIA / gpu-driver-container

The NVIDIA GPU driver container allows the provisioning of the NVIDIA driver through the use of containers.
Apache License 2.0
65 stars 33 forks source link

Secure boot - Unable to load the kernel module 'nvidia.ko', ubuntu24.04 DRIVER_ARCH=x86_64, DRIVER_VERSION=535.161.07, CUDA_VERSION=12.3.2 #44

Closed AntonioCayulao closed 3 months ago

AntonioCayulao commented 3 months ago

Hi all,

I'm using a RKE2 cluster with GPU-Operator but compiling the nvidia-driver source code to use the driver version 535.161.07, Cuda version 12.3.2 with ubuntu 24.04 and the secure boot feature enabled.

In the past I was able to create and have it up and running the container with all that conditions with the exception of the secure boot feature enabled.

I enrolled the keys on the Bios and I'm passing one of those to sign the nvidia module.

I removed donkey step, so I can pass to the script the keys directly and I have replaced the sign-file binary for the kmodsign and I added the nvidia.ko to sign it directly (just in case):

    if [ -n "${PRIVATE_KEY}" ]; then
        echo "Signing NVIDIA driver kernel modules..."
        sh -c "PATH=${PATH}:/usr/src/linux-headers-${KERNEL_VERSION}/scripts && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia.ko && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia.ko nvidia.ko.sign && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia-modeset.ko nvidia-modeset.ko.sign && \
          kmodsign sha512 ${PRIVATE_KEY} /drivers/kernel/pubkey.x509 nvidia-uvm.ko"
        ls -l
        nvidia_sign_args="--linked-module nvidia.ko --signed-module nvidia.ko.sign"
        nvidia_modeset_sign_args="--linked-module nvidia-modeset.ko --signed-module nvidia-modeset.ko.sign"
        nvidia_uvm_sign_args="--signed"

        echo "modinfo -F signer nvidia.ko"
        modinfo -F signer nvidia.ko
        echo "modinfo -F signer nvidia-uvm.ko"
        modinfo -F signer nvidia-uvm.ko
    fi

Afterwards, once the container is running the logs show me that the modules were signed.

Relinking NVIDIA driver kernel modules...
ld: warning: ./nvidia/nv-kernel.o_binary: missing .note.GNU-stack section implies executable stack
ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker
Signing NVIDIA driver kernel modules...
total 359276
-rw-r--r-- 1 root root    10949 Feb 18 00:02 Kbuild
-rw-r--r-- 1 root root     4801 Feb 18 00:02 Makefile
-rw-r--r-- 1 root root     9002 Jul  5 11:20 Module.symvers
drwxr-xr-x 3 root root     4096 Feb 18 00:04 common
drwxr-xr-x 4 root root     4096 Jul  5 11:19 conftest
-rwxr-xr-x 1 root root   251820 Feb 17 22:33 conftest.sh
-rw-r--r-- 1 root root      922 Feb 17 22:33 count-lines.mk
-rw-r--r-- 1 root root      239 Jul  5 11:20 modules.order
-rw-r--r-- 1 root root 12714128 Jul  5 11:20 nv-linux.o
-rw-r--r-- 1 root root   858720 Jul  5 11:20 nv-modeset-linux.o
-rw-r--r-- 1 root root       68 Jul  5 11:19 nv_compiler.h
drwxr-xr-x 5 root root    12288 Jul  5 11:20 nvidia
drwxr-xr-x 2 root root     4096 Jul  5 11:20 nvidia-drm
-rw-r--r-- 1 root root  4159744 Jul  5 11:20 nvidia-drm.ko
-rw-r--r-- 1 root root     1107 Jul  5 11:19 nvidia-drm.mod
-rw-r--r-- 1 root root    13755 Jul  5 11:20 nvidia-drm.mod.c
-rw-r--r-- 1 root root   157056 Jul  5 11:20 nvidia-drm.mod.o
-rw-r--r-- 1 root root  4005264 Jul  5 11:20 nvidia-drm.o
drwxr-xr-x 2 root root     4096 Jul  5 11:20 nvidia-modeset
-rw-r--r-- 1 root root  2499600 Jul  5 11:20 nvidia-modeset.ko
-rw-r--r-- 1 root root  2500045 Jul  5 11:20 nvidia-modeset.ko.sign
-rw-r--r-- 1 root root      205 Jul  5 11:19 nvidia-modeset.mod
-rw-r--r-- 1 root root     6740 Jul  5 11:20 nvidia-modeset.mod.c
-rw-r--r-- 1 root root   153600 Jul  5 11:20 nvidia-modeset.mod.o
-rw-r--r-- 1 root root  2348936 Jul  5 11:20 nvidia-modeset.o
drwxr-xr-x 2 root root     4096 Jul  5 11:20 nvidia-peermem
-rw-r--r-- 1 root root   389808 Jul  5 11:20 nvidia-peermem.ko
-rw-r--r-- 1 root root       66 Jul  5 11:19 nvidia-peermem.mod
-rw-r--r-- 1 root root     1133 Jul  5 11:20 nvidia-peermem.mod.c
-rw-r--r-- 1 root root   150544 Jul  5 11:20 nvidia-peermem.mod.o
-rw-r--r-- 1 root root   241008 Jul  5 11:20 nvidia-peermem.o
drwxr-xr-x 3 root root    20480 Jul  5 11:20 nvidia-uvm
-rw-r--r-- 1 root root 53730757 Jul  5 11:20 nvidia-uvm.ko
-rw-r--r-- 1 root root     7559 Jul  5 11:19 nvidia-uvm.mod
-rw-r--r-- 1 root root    17723 Jul  5 11:20 nvidia-uvm.mod.c
-rw-r--r-- 1 root root   158232 Jul  5 11:20 nvidia-uvm.mod.o
-rw-r--r-- 1 root root 53574720 Jul  5 11:20 nvidia-uvm.o
-rw-r--r-- 1 root root 76571533 Jul  5 11:20 nvidia.ko
-rw-r--r-- 1 root root 76571978 Jul  5 11:20 nvidia.ko.sign
-rw-r--r-- 1 root root     2609 Jul  5 11:19 nvidia.mod
-rw-r--r-- 1 root root    29023 Jul  5 11:20 nvidia.mod.c
-rw-r--r-- 1 root root   220656 Jul  5 11:20 nvidia.mod.o
-rw-r--r-- 1 root root 76396528 Jul  5 11:20 nvidia.o
modinfo -F signer nvidia.ko
<CN>
modinfo -F signer nvidia-uvm.ko
<CN>

But I have this error on the compilation process when nvidia try to load the modules:

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 20 CPUs online; setting concurrency level to 20.
Installing NVIDIA driver version 535.161.07.
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/6.8.0-35-generic/build'

Kernel output path: '/lib/modules/6.8.0-35-generic/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules
  : [##############################] 100%

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

Kernel module compilation complete.
Unable to determine if Secure Boot is enabled: No such file or directory
Kernel module load error: Key was rejected by service
Kernel messages:
[15237.195909] docker0: port 1(veth9b46407) entered blocking state
[15237.195916] docker0: port 1(veth9b46407) entered disabled state
[15237.195934] veth9b46407: entered allmulticast mode
[15237.195994] veth9b46407: entered promiscuous mode
[15237.196494] docker0: port 2(veth1b83b48) entered disabled state
[15237.197551] veth1b83b48 (unregistering): left allmulticast mode
[15237.197553] veth1b83b48 (unregistering): left promiscuous mode
[15237.197560] docker0: port 2(veth1b83b48) entered disabled state
[15237.402930] eth0: renamed from vetha1b1b5d
[15237.411020] docker0: port 1(veth9b46407) entered blocking state
[15237.411030] docker0: port 1(veth9b46407) entered forwarding state
[15237.452814] vetha1b1b5d: renamed from eth0
[15237.466296] docker0: port 1(veth9b46407) entered disabled state
[15237.483747] docker0: port 1(veth9b46407) entered disabled state
[15237.484038] veth9b46407 (unregistering): left allmulticast mode
[15237.484041] veth9b46407 (unregistering): left promiscuous mode
[15237.484049] docker0: port 1(veth9b46407) entered disabled state
[15347.929599] VFIO - User Level meta-driver version: 0.3
[15348.041634] Loading of unsigned module is rejected
[15413.411620] VFIO - User Level meta-driver version: 0.3
[15413.516083] Loading of unsigned module is rejected
[15472.262948] VFIO - User Level meta-driver version: 0.3
[15472.366325] Loading of unsigned module is rejected
[15593.975027] VFIO - User Level meta-driver version: 0.3
[15594.083684] Loading of unsigned module is rejected
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

In addition I added the options to the NVIDIA-Linux-\$DRIVER_ARCH-\$DRIVER_VERSION.run and nvidia-installer script:

--module-signing-secret-key="\${PRIVATE_KEY}"

--module-signing-public-key=/drivers/kernel/pubkey.x509

Following the documentation here:

init() {
    if [ "${DRIVER_TYPE}" = "vgpu" ]; then
        _find_vgpu_driver_version || exit 1
    fi

    # Install the userspace components and copy the kernel module sources.
    sh NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION.run -x \
        --module-signing-secret-key="${PRIVATE_KEY}" \
        --module-signing-public-key=/drivers/kernel/pubkey.x509 && \
        cd NVIDIA-Linux-$DRIVER_ARCH-$DRIVER_VERSION && \
        ./nvidia-installer --silent \
                    --module-signing-secret-key="${PRIVATE_KEY}" \
                    --module-signing-public-key=/drivers/kernel/pubkey.x509 \
                    --no-dkms \
                    --force-selinux=no \
                    --disable-nouveau \
                    --no-kernel-module \
                    --no-nvidia-modprobe \
                    --no-rpms \
                    --no-backup \
                    --no-check-for-alternate-installs \
                    --no-libglx-indirect \
                    --no-install-libglvnd \
                    --x-prefix=/tmp/null \
                    --x-module-path=/tmp/null \
                    --x-library-path=/tmp/null \
                    --x-sysconfig-path=/tmp/null && \
        mkdir -p /usr/src/nvidia-${DRIVER_VERSION} && \
        mv LICENSE mkprecompiled ${KERNEL_TYPE} /usr/src/nvidia-${DRIVER_VERSION} && \
        sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest > /usr/src/nvidia-${DRIVER_VERSION}/.manifest

    echo -e "\n========== NVIDIA Software Installer ==========\n"
    echo -e "Starting installation of NVIDIA driver version ${DRIVER_VERSION} for Linux kernel version ${KERNEL_VERSION}\n"

I attached the full logs from /var/log/nvidia-installer.log

I hope you can help me to solve this issue :disappointed:

Have a good day,

Antonio.

nvidia-installer.log

AntonioCayulao commented 3 months ago

[Solved] All the magic was in passing the arguments to the _install_driver() funtion:

_install_driver() {
    local install_args=()

    echo "Installing NVIDIA driver kernel modules..."
    cd /usr/src/nvidia-${DRIVER_VERSION}
    if [ -d /lib/modules/${KERNEL_VERSION}/kernel/drivers/video ]; then
        rm -rf /lib/modules/${KERNEL_VERSION}/kernel/drivers/video
    else
        rm -rf /lib/modules/${KERNEL_VERSION}/video
    fi

    if [ "${ACCEPT_LICENSE}" = "yes" ]; then
        install_args+=("--accept-license")
    fi

    nvidia-installer --module-signing-secret-key="${PRIVATE_KEY}" \
                     --module-signing-public-key=/drivers/kernel/pubkey.x509 \
                     --kernel-module-only --no-drm --ui=none --no-nouveau-check -m=${KERNEL_TYPE} ${install_args[@]+"${install_args[@]}"}
}

btw, I have the argument ACCEPT_LICENSE="".