NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.76k stars 285 forks source link

Nvidia GPU-operator on OKD 4.6 cluster has had problem with FedoraCoreOS kernel-header #144

Open rupang790 opened 3 years ago

rupang790 commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

During the Installation of Nvidia GPU-operator on OKD 4.6 cluster has had problem with KOJI URLs.

2. Steps to reproduce the issue

Prepare Nvidia driver for Fedora CoreOS 33 on local repository and install gpu-operator by helm.

3. Information to attach (optional if deemed irrelevant)

I've used GPU-Operator plugin on OKD 4.5 recently, and now I would like to use it on OKD 4.6 latest stable version. So I tried to install it on my OKD 4.6 cluster, but error logs are appeared on 'nvidia-driver-daemonset' pod as below:

Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ KOJI_URL=https://kojipkgs.fedoraproject.org
+ KOJI_KVER=5.9.16
+ KOJI_MAJMIN=5.9
+ KOJI_PATCH=5.9.16-200.fc33
+ KOJI_PATCH=200.fc33
+ KOJI_ARCH=x86_64
+ KOJI_KERNEL_HEADERS=https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.9.16/200.fc33/x86_64/kernel-headers-5.9.16-200.fc33.x86_64.rpm
+ KOJI_KERNEL_DEVEL=https://kojipkgs.fedoraproject.org/packages/kernel/5.9.16/200.fc33/x86_64/kernel-devel-5.9.16-200.fc33.x86_64.rpm
+ KOJI_KERNEL_CORE=https://kojipkgs.fedoraproject.org/packages/kernel/5.9.16/200.fc33/x86_64/kernel-core-5.9.16-200.fc33.x86_64.rpm
+ dnf -q -y install kernel-headers-5.9.16-200.fc33.x86_64 --setopt=install_weak_deps=False --best
Error: Unable to find a match: kernel-headers-5.9.16-200.fc33.x86_64
+ dnf -y install https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.9.16/200.fc33/x86_64/kernel-headers-5.9.16-200.fc33.x86_64.rpm --setopt=install_weak_deps=False --best
Status code: 404 for https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.9.16/200.fc33/x86_64/kernel-headers-5.9.16-200.fc33.x86_64.rpm (IP: 38.145.60.21)
+ dnf -y install 'kernel-headers-5.9.*' --setopt=install_weak_deps=False --best
Last metadata expiration check: 0:00:13 ago on Wed Feb 10 01:58:01 2021.
No match for argument: kernel-headers-5.9.*
Error: Unable to find a match: kernel-headers-5.9.*
++ rm -rf /tmp/tmp.YOd4ASQKKT
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
Stopping NVIDIA persistence daemon...
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
Unloading NVIDIA driver kernel modules...
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0

According to error logs, it seems could not find the https://kojipkgs.fedoraproject.org/packages/kernel/5.9.16 pages. The results of accessing the pages on my explorer is below (Could not find '5.9.16' versions) image

Where can I change the version values to use existed version? or Do I have to wait for updating of that?

shivamerla commented 3 years ago

@rupang790 Driver container will always compile for current running kernel on the host. We cannot edit this, so you would need to use worker nodes for which packages are available. I see kernel-headers available for 5.9.13.

https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.9.13/ https://kojipkgs.fedoraproject.org/packages/kernel/5.9.13/

Source code to build images is here if you need other changes: https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora

rupang790 commented 3 years ago

@shivamerla, Fedora CoreOS 33 is used for OKD 4.6 Cluster default, so I could not change to other OS which use available kernel version. (Actually, during the installation, OKD changes node's OS version(kernel also))

So If I download kernel-* rpms on node manually and deploy, the GPU-operator will work? Or are there other problem from the log? I attached nvidia-driver-daemonset log as below:

+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=450.51.05
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=5.10.12-200.fc33.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ init
+ echo -e '\n========== NVIDIA Software Installer ==========\n'

========== NVIDIA Software Installer ==========

+ echo -e 'Starting installation of NVIDIA driver version 450.51.05 for Linux kernel version 5.10.12-200.fc33.x86_64\n'
Starting installation of NVIDIA driver version 450.51.05 for Linux kernel version 5.10.12-200.fc33.x86_64

+ exec
+ flock -n 3
+ echo 2304362
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
Unloading NVIDIA driver kernel modules...
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
Checking NVIDIA driver packages...
+ echo 'Checking NVIDIA driver packages...'
+ [[ ! -d /usr/src/nvidia-450.51.05/kernel ]]
+ cd /usr/src/nvidia-450.51.05/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/5.10.12-200.fc33.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ yum -q makecache
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.wOhJUJdwsn
+ trap 'rm -rf /tmp/tmp.wOhJUJdwsn' EXIT
+ cd /tmp/tmp.wOhJUJdwsn
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ rm -rf /lib/modules/5.10.12-200.fc33.x86_64
+ mkdir -p /lib/modules/5.10.12-200.fc33.x86_64/proc
+ echo 'Installing Linux kernel headers...'
Installing Linux kernel headers...
+ KOJI_URL=https://kojipkgs.fedoraproject.org
+ KOJI_KVER=5.10.12
+ KOJI_MAJMIN=5.10
+ KOJI_PATCH=5.10.12-200.fc33
+ KOJI_PATCH=200.fc33
+ KOJI_ARCH=x86_64
+ KOJI_KERNEL_HEADERS=https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.10.12/200.fc33/x86_64/kernel-headers-5.10.12-200.fc33.x86_64.rpm
+ KOJI_KERNEL_DEVEL=https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-devel-5.10.12-200.fc33.x86_64.rpm
+ KOJI_KERNEL_CORE=https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-core-5.10.12-200.fc33.x86_64.rpm
+ dnf -q -y install kernel-headers-5.10.12-200.fc33.x86_64 --setopt=install_weak_deps=False --best
Error: Unable to find a match: kernel-headers-5.10.12-200.fc33.x86_64
+ dnf -y install https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.10.12/200.fc33/x86_64/kernel-headers-5.10.12-200.fc33.x86_64.rpm --setopt=install_weak_deps=False --best
Status code: 404 for https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.10.12/200.fc33/x86_64/kernel-headers-5.10.12-200.fc33.x86_64.rpm (IP: 38.145.60.20)
+ dnf -y install 'kernel-headers-5.10.*' --setopt=install_weak_deps=False --best
Last metadata expiration check: 0:00:19 ago on Mon Mar 15 01:44:33 2021.
Package kernel-headers-5.10.20-200.fc33.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
+ dnf -q -y install kernel-devel-5.10.12-200.fc33.x86_64 --setopt=install_weak_deps=False --best
Error: Unable to find a match: kernel-devel-5.10.12-200.fc33.x86_64
+ dnf -y install https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-devel-5.10.12-200.fc33.x86_64.rpm --setopt=install_weak_deps=False --best
+ ln -s /usr/src/kernels/5.10.12-200.fc33.x86_64 /lib/modules/5.10.12-200.fc33.x86_64/build
Installing Linux kernel module files...
+ echo 'Installing Linux kernel module files...'
++ pwd
+ dnf install -y -q kernel-core-5.10.12-200.fc33.x86_64 --setopt=install_weak_deps=False --best --downloadonly --downloaddir=/tmp/tmp.wOhJUJdwsn
Error: Unable to find a match: kernel-core-5.10.12-200.fc33.x86_64
++ pwd
+ curl https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-core-5.10.12-200.fc33.x86_64.rpm -o /tmp/tmp.wOhJUJdwsn/kernel-core-5.10.12-200.fc33.x86_64.rpm
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 33.1M    0  8192    0     0  10163      0  0:57:04 --:--:--  0:57:04 10163
  1 33.1M    1  678k    0     0   371k      0  0:01:31  0:00:01  0:01:30  371k
  8 33.1M    8 2782k    0     0  1004k      0  0:00:33  0:00:02  0:00:31 1003k
 13 33.1M   13 4638k    0     0  1255k      0  0:00:27  0:00:03  0:00:24 1255k
 19 33.1M   19 6654k    0     0  1411k      0  0:00:24  0:00:04  0:00:20 1411k
 25 33.1M   25 8742k    0     0  1536k      0  0:00:22  0:00:05  0:00:17 1787k
 32 33.1M   32 10.6M    0     0  1628k      0  0:00:20  0:00:06  0:00:14 2100k
 38 33.1M   38 12.8M    0     0  1712k      0  0:00:19  0:00:07  0:00:12 2111k
 44 33.1M   44 14.6M    0     0  1723k      0  0:00:19  0:00:08  0:00:11 2069k
 50 33.1M   50 16.6M    0     0  1764k      0  0:00:19  0:00:09  0:00:10 2098k
 55 33.1M   55 18.3M    0     0  1759k      0  0:00:19  0:00:10  0:00:09 2013k
 58 33.1M   58 19.5M    0     0  1710k      0  0:00:19  0:00:11  0:00:08 1820k
 61 33.1M   61 20.3M    0     0  1641k      0  0:00:20  0:00:12  0:00:08 1532k
 63 33.1M   63 21.1M    0     0  1584k      0  0:00:21  0:00:13  0:00:08 1341k
 65 33.1M   65 21.8M    0     0  1519k      0  0:00:22  0:00:14  0:00:08 1049k
 67 33.1M   67 22.5M    0     0  1467k      0  0:00:23  0:00:15  0:00:08  842k
 69 33.1M   69 23.0M    0     0  1414k      0  0:00:24  0:00:16  0:00:08  726k
 71 33.1M   71 23.7M    0     0  1376k      0  0:00:24  0:00:17  0:00:07  700k
 73 33.1M   73 24.4M    0     0  1339k      0  0:00:25  0:00:18  0:00:07  667k
 75 33.1M   75 25.0M    0     0  1301k      0  0:00:26  0:00:19  0:00:07  655k
 77 33.1M   77 25.6M    0     0  1267k      0  0:00:26  0:00:20  0:00:06  638k
 78 33.1M   78 26.2M    0     0  1238k      0  0:00:27  0:00:21  0:00:06  645k
 80 33.1M   80 26.6M    0     0  1203k      0  0:00:28  0:00:22  0:00:06  592k
 81 33.1M   81 27.1M    0     0  1172k      0  0:00:28  0:00:23  0:00:05  550k
 82 33.1M   82 27.5M    0     0  1141k      0  0:00:29  0:00:24  0:00:05  513k
 84 33.1M   84 27.9M    0     0  1114k      0  0:00:30  0:00:25  0:00:05  481k
 85 33.1M   85 28.4M    0     0  1090k      0  0:00:31  0:00:26  0:00:05  450k
 87 33.1M   87 28.8M    0     0  1067k      0  0:00:31  0:00:27  0:00:04  450k
 88 33.1M   88 29.3M    0     0  1047k      0  0:00:32  0:00:28  0:00:04  452k
 89 33.1M   89 29.8M    0     0  1028k      0  0:00:33  0:00:29  0:00:04  471k
 91 33.1M   91 30.4M    0     0  1014k      0  0:00:33  0:00:30  0:00:03  502k
 93 33.1M   93 31.1M    0     0  1006k      0  0:00:33  0:00:31  0:00:02  555k
 96 33.1M   96 32.0M    0     0  1004k      0  0:00:33  0:00:32  0:00:01  653k
100 33.1M  100 33.1M    0     0  1010k      0  0:00:33  0:00:33 --:--:--  799k
+ cat ./kernel-core-5.10.12-200.fc33.x86_64.rpm
+ rpm2cpio
+ cpio -idm --quiet
+ rm ./kernel-core-5.10.12-200.fc33.x86_64.rpm
+ mv lib/modules/5.10.12-200.fc33.x86_64/modules.block lib/modules/5.10.12-200.fc33.x86_64/modules.builtin lib/modules/5.10.12-200.fc33.x86_64/modules.builtin.alias.bin lib/modules/5.10.12-200.fc33.x86_64/modules.builtin.modinfo lib/modules/5.10.12-200.fc33.x86_64/modules.drm lib/modules/5.10.12-200.fc33.x86_64/modules.modesetting lib/modules/5.10.12-200.fc33.x86_64/modules.networking lib/modules/5.10.12-200.fc33.x86_64/modules.order /lib/modules/5.10.12-200.fc33.x86_64
+ mv lib/modules/5.10.12-200.fc33.x86_64/kernel /lib/modules/5.10.12-200.fc33.x86_64
+ depmod 5.10.12-200.fc33.x86_64