NVIDIA / nvidia-installer

NVIDIA driver installer
GNU General Public License v2.0
131 stars 27 forks source link

NVIDIA installer failing to recognize pre-compiled kernel modules #25

Open dioguerra opened 2 years ago

dioguerra commented 2 years ago

Hello,

I'm currently trying to create an nvidia driver module for fedora, heavily based on the project available at (https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora) and until now i found some errors/bugs or bad descriptions of the arguments that are available on the nvidia-installer

  1. --kernel-install-path not working, in the way that the modules are not set/copied to the desired location (they instead stay in the current directory.
  2. --no-kernel-module not only does not install the module (which is correct) but it also does not compile it (which is not clear in the help menu).

For 2 the bad part is that there is no option to just compile the modules without trying to install them, forcing the user to hack the nvidia-installer command.

Reproduce by replacing the dockerfile below in the https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora repo, editing the Makefile to include the fedora distro and compiling with make build-fedora-470.129.06

ARG FEDORA_VERSION=fedora@sha256:f1e3a29da8990568c1da6a460cf9658ee7e9b409aa39c2aded67f7ac1dfe7e8a
ARG CUDA_VERSION=11.7.0

FROM nvidia/cuda:$CUDA_VERSION-base-ubi8 as license

FROM fedora:$FEDORA_VERSION

RUN dnf install -y \
        ca-certificates \
        curl \
        gcc \
        glibc.i686 \
        make \
        cpio \
        kmod \
        'dnf-command(download)' && \
    rm -rf /var/cache/yum/*

RUN curl -fsSL -o /usr/local/bin/donkey https://github.com/3XX0/donkey/releases/download/v1.1.0/donkey && \
    curl -fsSL -o /usr/local/bin/extract-vmlinux https://raw.githubusercontent.com/torvalds/linux/master/scripts/extract-vmlinux && \
    chmod +x /usr/local/bin/donkey /usr/local/bin/extract-vmlinux

#ARG BASE_URL=http://us.download.nvidia.com/XFree86/Linux-x86_64
ARG BASE_URL=https://us.download.nvidia.com/tesla
ARG DRIVER_VERSION=470.129.06
ENV DRIVER_VERSION=$DRIVER_VERSION

RUN ln -s /sbin/ldconfig /sbin/ldconfig.real
RUN dnf install util-linux  -y
#kernel-devel kernel-headers
#module-init-tools
#kernel-devel-5.18.10-200.fc36.x86_64
RUN curl -O https://kojipkgs.fedoraproject.org//packages/kernel/5.16.13/200.fc35/x86_64/kernel-devel-5.16.13-200.fc35.x86_64.rpm && \
    dnf install kernel-devel-5.16.13-200.fc35.x86_64.rpm -y

RUN mkdir -p /lib/modules/5.16.13-200.fc35.x86_64/kernel/drivers/video
# Install the userspace components and copy the kernel module sources.
RUN cd /tmp && \
    curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run && \
    sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run -x && \
    cd NVIDIA-Linux-x86_64-$DRIVER_VERSION && \
    ./nvidia-installer --silent \
                       --kernel-source-path=/usr/src/kernels/5.16.13-200.fc35.x86_64 \
                       --kernel-install-path=/lib/modules/5.16.13-200.fc35.x86_64/kernel/drivers/video \
                       --install-compat32-libs \
                       --no-nouveau-check \
                       --no-nvidia-modprobe \
                       --no-rpms \
                       --no-backup \
                       --no-check-for-alternate-installs \
                       --no-libglx-indirect \
                       --no-install-libglvnd \
                       --x-prefix=/tmp/null \
                       --x-module-path=/tmp/null \
                       --x-library-path=/tmp/null \
                       --x-sysconfig-path=/tmp/null || true && \
    mkdir -p /usr/src/nvidia-$DRIVER_VERSION && \
    mv LICENSE mkprecompiled kernel /usr/src/nvidia-$DRIVER_VERSION && \
    sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest > /usr/src/nvidia-$DRIVER_VERSION/.manifest && \
    rm -rf /tmp/*

COPY nvidia-driver /usr/local/bin

WORKDIR /usr/src/nvidia-$DRIVER_VERSION

ARG PUBLIC_KEY=empty
COPY ${PUBLIC_KEY} kernel/pubkey.x509

ARG PRIVATE_KEY
ARG KERNEL_VERSION=latest

LABEL io.k8s.display-name="NVIDIA Driver Container"
LABEL name="NVIDIA Driver Container"
LABEL vendor="NVIDIA"
LABEL version="${DRIVER_VERSION}"
LABEL release="N/A"
LABEL summary="Provision the NVIDIA driver through containers"
LABEL description="See summary"

# Add NGC DL license
COPY --from=license /NGC-DL-CONTAINER-LICENSE /licenses/NGC-DL-CONTAINER-LICENSE

ENTRYPOINT ["nvidia-driver", "init"]

Also, and since we are on the subject of building the nvidia kernel modules, Even if i have the precompiled the kernel modules (which are copied to /usr/src/nvidia-$DRIVER_VERSION as per dockerfile above, When i use the image on the target VM (which has one nvidia GPU T4) the container will run with nvidia-driver init not recognizing the existent modules (that come with the image) and tries to recompile everything again (without success) as it tries to get dependencies that are not available on the fedora koji repo.

+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=470.129.06
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ KOJI_BASE_URL=https://kojipkgs.fedoraproject.org
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=5.16.13-200.fc35.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ init
+ echo -e '\n========== NVIDIA Software Installer ==========\n'

========== NVIDIA Software Installer ==========

+ echo -e 'Starting installation of NVIDIA driver version 470.129.06 for Linux kernel version 5.16.13-200.fc35.x86_64\n'
Starting installation of NVIDIA driver version 470.129.06 for Linux kernel version 5.16.13-200.fc35.x86_64

+ exec
+ flock -n 3
+ echo 1128694
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
Checking NVIDIA driver packages...
+ [[ ! -d /usr/src/nvidia-470.129.06/kernel ]]
+ cd /usr/src/nvidia-470.129.06/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/5.16.13-200.fc35.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ dnf -q makecache
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.3GEbfgOEOZ
+ trap 'rm -rf /tmp/tmp.3GEbfgOEOZ' EXIT
+ cd /tmp/tmp.3GEbfgOEOZ
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ mkdir -p /lib/modules/5.16.13-200.fc35.x86_64/proc
+ KERNEL_RPM_VERSION=5.16.13
+ KERNEL_RPM_RELEASE=5.16.13-200.fc35
+ KERNEL_RPM_RELEASE=200.fc35
+ KERNEL_RPM_ARCH=x86_64
+ echo 'Installing Linux kernel headers...'
Installing Linux kernel headers...
+ dnf -q -y install kernel-headers-5.16.13-200.fc35.x86_64
Error: Unable to find a match: kernel-headers-5.16.13-200.fc35.x86_64
+ echo 'Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in repositories.'
Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in repositories.
+ echo 'Trying to download kernel-headers from koji...'
Trying to download kernel-headers from koji...
+ KOJI_KERNEL_HEADERS_RPM=https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.16.13/200.fc35/x86_64/kernel-headers-5.16.13-200.fc35.x86_64.rpm
+ dnf -q -y install https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.16.13/200.fc35/x86_64/kernel-headers-5.16.13-200.fc35.x86_64.rpm --setopt=install_weak_deps=False
Status code: 404 for https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.16.13/200.fc35/x86_64/kernel-headers-5.16.13-200.fc35.x86_64.rpm (IP: 38.145.60.20)
+ echo 'Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in koji.'
Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in koji.
+ echo 'Installing generic version...'
Installing generic version...
+ dnf -q -y install kernel-headers
+ echo 'Installing Linux development files...'
Installing Linux development files...
+ dnf -q -y install kernel-devel-5.16.13-200.fc35.x86_64
+ ln -s /usr/src/kernels/5.16.13-200.fc35.x86_64 /lib/modules/5.16.13-200.fc35.x86_64/build
+ echo 'Installing Linux kernel module files...'
Installing Linux kernel module files...
+ dnf -q -y download kernel-core-5.16.13-200.fc35.x86_64
No package kernel-core-5.16.13-200.fc35.x86_64 available.
Exiting due to strict setting.
Error: No package kernel-core-5.16.13-200.fc35.x86_64 available.
+ echo 'Failed to find kernel-core-5.16.13-200.fc35.x86_64 in repositories.'
Failed to find kernel-core-5.16.13-200.fc35.x86_64 in repositories.
+ echo 'Trying to download kernel-core from koji...'
Trying to download kernel-core from koji...
+ KOJI_KERNEL_CORE_RPM=https://kojipkgs.fedoraproject.org/packages/kernel/5.16.13/200.fc35/x86_64/kernel-core-5.16.13-200.fc35.x86_64.rpm
+ dnf -q -y download https://kojipkgs.fedoraproject.org/packages/kernel/5.16.13/200.fc35/x86_64/kernel-core-5.16.13-200.fc35.x86_64.rpm
+ cat ./kernel-core-5.16.13-200.fc35.x86_64.rpm
+ rpm2cpio
+ cpio -idm --quiet
+ rm ./kernel-core-5.16.13-200.fc35.x86_64.rpm
+ mv lib/modules/5.16.13-200.fc35.x86_64/modules.block lib/modules/5.16.13-200.fc35.x86_64/modules.builtin lib/modules/5.16.13-200.fc35.x86_64/modules.builtin.modinfo lib/modules/5.16.13-200.fc35.x86_64/modules.drm lib/modules/5.16.13-200.fc35.x86_64/modules.modesetting lib/modules/5.16.13-200.fc35.x86_64/modules.networking lib/modules/5.16.13-200.fc35.x86_64/modules.order /lib/modules/5.16.13-200.fc35.x86_64
+ mv lib/modules/5.16.13-200.fc35.x86_64/kernel /lib/modules/5.16.13-200.fc35.x86_64
+ depmod 5.16.13-200.fc35.x86_64
+ echo 'Generating Linux kernel version string...'
Generating Linux kernel version string...
+ extract-vmlinux ./lib/modules/5.16.13-200.fc35.x86_64/vmlinuz
+ strings
+ sed 's/^\(.*\)\s\+(.*)$/\1/'
+ grep -E '^Linux version'
extract-vmlinux: Cannot find vmlinux.
+ '[' -z '' ']'
+ echo 'Could not locate Linux kernel version string'
Could not locate Linux kernel version string
+ return 1
++ rm -rf /tmp/tmp.3GEbfgOEOZ
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0

Why does this happen, and why cant we use the existent kernel modules?

$ls kernel/*.ko
kernel/nvidia-drm.ko      kernel/nvidia-peermem.ko  kernel/nvidia.ko
kernel/nvidia-modeset.ko  kernel/nvidia-uvm.ko