I'm currently trying to create an nvidia driver module for fedora, heavily based on the project available at (https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora) and until now i found some errors/bugs or bad descriptions of the arguments that are available on the nvidia-installer
--kernel-install-path not working, in the way that the modules are not set/copied to the desired location (they instead stay in the current directory.
--no-kernel-module not only does not install the module (which is correct) but it also does not compile it (which is not clear in the help menu).
For 2 the bad part is that there is no option to just compile the modules without trying to install them, forcing the user to hack the nvidia-installer command.
Also, and since we are on the subject of building the nvidia kernel modules,
Even if i have the precompiled the kernel modules (which are copied to /usr/src/nvidia-$DRIVER_VERSION as per dockerfile above,
When i use the image on the target VM (which has one nvidia GPU T4) the container will run with nvidia-driver init not recognizing the existent modules (that come with the image) and tries to recompile everything again (without success) as it tries to get dependencies that are not available on the fedora koji repo.
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=470.129.06
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ KOJI_BASE_URL=https://kojipkgs.fedoraproject.org
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=5.16.13-200.fc35.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ init
+ echo -e '\n========== NVIDIA Software Installer ==========\n'
========== NVIDIA Software Installer ==========
+ echo -e 'Starting installation of NVIDIA driver version 470.129.06 for Linux kernel version 5.16.13-200.fc35.x86_64\n'
Starting installation of NVIDIA driver version 470.129.06 for Linux kernel version 5.16.13-200.fc35.x86_64
+ exec
+ flock -n 3
+ echo 1128694
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
Checking NVIDIA driver packages...
+ [[ ! -d /usr/src/nvidia-470.129.06/kernel ]]
+ cd /usr/src/nvidia-470.129.06/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/5.16.13-200.fc35.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ dnf -q makecache
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.3GEbfgOEOZ
+ trap 'rm -rf /tmp/tmp.3GEbfgOEOZ' EXIT
+ cd /tmp/tmp.3GEbfgOEOZ
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ mkdir -p /lib/modules/5.16.13-200.fc35.x86_64/proc
+ KERNEL_RPM_VERSION=5.16.13
+ KERNEL_RPM_RELEASE=5.16.13-200.fc35
+ KERNEL_RPM_RELEASE=200.fc35
+ KERNEL_RPM_ARCH=x86_64
+ echo 'Installing Linux kernel headers...'
Installing Linux kernel headers...
+ dnf -q -y install kernel-headers-5.16.13-200.fc35.x86_64
Error: Unable to find a match: kernel-headers-5.16.13-200.fc35.x86_64
+ echo 'Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in repositories.'
Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in repositories.
+ echo 'Trying to download kernel-headers from koji...'
Trying to download kernel-headers from koji...
+ KOJI_KERNEL_HEADERS_RPM=https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.16.13/200.fc35/x86_64/kernel-headers-5.16.13-200.fc35.x86_64.rpm
+ dnf -q -y install https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.16.13/200.fc35/x86_64/kernel-headers-5.16.13-200.fc35.x86_64.rpm --setopt=install_weak_deps=False
Status code: 404 for https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.16.13/200.fc35/x86_64/kernel-headers-5.16.13-200.fc35.x86_64.rpm (IP: 38.145.60.20)
+ echo 'Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in koji.'
Failed to find kernel-headers-5.16.13-200.fc35.x86_64 in koji.
+ echo 'Installing generic version...'
Installing generic version...
+ dnf -q -y install kernel-headers
+ echo 'Installing Linux development files...'
Installing Linux development files...
+ dnf -q -y install kernel-devel-5.16.13-200.fc35.x86_64
+ ln -s /usr/src/kernels/5.16.13-200.fc35.x86_64 /lib/modules/5.16.13-200.fc35.x86_64/build
+ echo 'Installing Linux kernel module files...'
Installing Linux kernel module files...
+ dnf -q -y download kernel-core-5.16.13-200.fc35.x86_64
No package kernel-core-5.16.13-200.fc35.x86_64 available.
Exiting due to strict setting.
Error: No package kernel-core-5.16.13-200.fc35.x86_64 available.
+ echo 'Failed to find kernel-core-5.16.13-200.fc35.x86_64 in repositories.'
Failed to find kernel-core-5.16.13-200.fc35.x86_64 in repositories.
+ echo 'Trying to download kernel-core from koji...'
Trying to download kernel-core from koji...
+ KOJI_KERNEL_CORE_RPM=https://kojipkgs.fedoraproject.org/packages/kernel/5.16.13/200.fc35/x86_64/kernel-core-5.16.13-200.fc35.x86_64.rpm
+ dnf -q -y download https://kojipkgs.fedoraproject.org/packages/kernel/5.16.13/200.fc35/x86_64/kernel-core-5.16.13-200.fc35.x86_64.rpm
+ cat ./kernel-core-5.16.13-200.fc35.x86_64.rpm
+ rpm2cpio
+ cpio -idm --quiet
+ rm ./kernel-core-5.16.13-200.fc35.x86_64.rpm
+ mv lib/modules/5.16.13-200.fc35.x86_64/modules.block lib/modules/5.16.13-200.fc35.x86_64/modules.builtin lib/modules/5.16.13-200.fc35.x86_64/modules.builtin.modinfo lib/modules/5.16.13-200.fc35.x86_64/modules.drm lib/modules/5.16.13-200.fc35.x86_64/modules.modesetting lib/modules/5.16.13-200.fc35.x86_64/modules.networking lib/modules/5.16.13-200.fc35.x86_64/modules.order /lib/modules/5.16.13-200.fc35.x86_64
+ mv lib/modules/5.16.13-200.fc35.x86_64/kernel /lib/modules/5.16.13-200.fc35.x86_64
+ depmod 5.16.13-200.fc35.x86_64
+ echo 'Generating Linux kernel version string...'
Generating Linux kernel version string...
+ extract-vmlinux ./lib/modules/5.16.13-200.fc35.x86_64/vmlinuz
+ strings
+ sed 's/^\(.*\)\s\+(.*)$/\1/'
+ grep -E '^Linux version'
extract-vmlinux: Cannot find vmlinux.
+ '[' -z '' ']'
+ echo 'Could not locate Linux kernel version string'
Could not locate Linux kernel version string
+ return 1
++ rm -rf /tmp/tmp.3GEbfgOEOZ
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0
Why does this happen, and why cant we use the existent kernel modules?
Hello,
I'm currently trying to create an nvidia driver module for fedora, heavily based on the project available at (https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora) and until now i found some errors/bugs or bad descriptions of the arguments that are available on the nvidia-installer
--kernel-install-path
not working, in the way that the modules are not set/copied to the desired location (they instead stay in the current directory.--no-kernel-module
not only does not install the module (which is correct) but it also does not compile it (which is not clear in the help menu).For 2 the bad part is that there is no option to just compile the modules without trying to install them, forcing the user to hack the nvidia-installer command.
Reproduce by replacing the dockerfile below in the https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora repo, editing the Makefile to include the fedora distro and compiling with
make build-fedora-470.129.06
Also, and since we are on the subject of building the nvidia kernel modules, Even if i have the precompiled the kernel modules (which are copied to /usr/src/nvidia-$DRIVER_VERSION as per dockerfile above, When i use the image on the target VM (which has one nvidia GPU T4) the container will run with
nvidia-driver init
not recognizing the existent modules (that come with the image) and tries to recompile everything again (without success) as it tries to get dependencies that are not available on the fedora koji repo.Why does this happen, and why cant we use the existent kernel modules?