Open rupang790 opened 3 years ago
@rupang790 Driver container will always compile for current running kernel on the host. We cannot edit this, so you would need to use worker nodes for which packages are available. I see kernel-headers available for 5.9.13.
https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.9.13/ https://kojipkgs.fedoraproject.org/packages/kernel/5.9.13/
Source code to build images is here if you need other changes: https://gitlab.com/nvidia/container-images/driver/-/tree/master/fedora
@shivamerla, Fedora CoreOS 33 is used for OKD 4.6 Cluster default, so I could not change to other OS which use available kernel version. (Actually, during the installation, OKD changes node's OS version(kernel also))
So If I download kernel-* rpms on node manually and deploy, the GPU-operator will work?
Or are there other problem from the log?
I attached nvidia-driver-daemonset
log as below:
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=450.51.05
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=5.10.12-200.fc33.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ init
+ echo -e '\n========== NVIDIA Software Installer ==========\n'
========== NVIDIA Software Installer ==========
+ echo -e 'Starting installation of NVIDIA driver version 450.51.05 for Linux kernel version 5.10.12-200.fc33.x86_64\n'
Starting installation of NVIDIA driver version 450.51.05 for Linux kernel version 5.10.12-200.fc33.x86_64
+ exec
+ flock -n 3
+ echo 2304362
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
Unloading NVIDIA driver kernel modules...
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
Unmounting NVIDIA driver rootfs...
+ echo 'Unmounting NVIDIA driver rootfs...'
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _kernel_requires_package
+ local proc_mount_arg=
Checking NVIDIA driver packages...
+ echo 'Checking NVIDIA driver packages...'
+ [[ ! -d /usr/src/nvidia-450.51.05/kernel ]]
+ cd /usr/src/nvidia-450.51.05/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/5.10.12-200.fc33.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
Updating the package cache...
+ yum -q makecache
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.wOhJUJdwsn
+ trap 'rm -rf /tmp/tmp.wOhJUJdwsn' EXIT
+ cd /tmp/tmp.wOhJUJdwsn
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ rm -rf /lib/modules/5.10.12-200.fc33.x86_64
+ mkdir -p /lib/modules/5.10.12-200.fc33.x86_64/proc
+ echo 'Installing Linux kernel headers...'
Installing Linux kernel headers...
+ KOJI_URL=https://kojipkgs.fedoraproject.org
+ KOJI_KVER=5.10.12
+ KOJI_MAJMIN=5.10
+ KOJI_PATCH=5.10.12-200.fc33
+ KOJI_PATCH=200.fc33
+ KOJI_ARCH=x86_64
+ KOJI_KERNEL_HEADERS=https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.10.12/200.fc33/x86_64/kernel-headers-5.10.12-200.fc33.x86_64.rpm
+ KOJI_KERNEL_DEVEL=https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-devel-5.10.12-200.fc33.x86_64.rpm
+ KOJI_KERNEL_CORE=https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-core-5.10.12-200.fc33.x86_64.rpm
+ dnf -q -y install kernel-headers-5.10.12-200.fc33.x86_64 --setopt=install_weak_deps=False --best
Error: Unable to find a match: kernel-headers-5.10.12-200.fc33.x86_64
+ dnf -y install https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.10.12/200.fc33/x86_64/kernel-headers-5.10.12-200.fc33.x86_64.rpm --setopt=install_weak_deps=False --best
Status code: 404 for https://kojipkgs.fedoraproject.org/packages/kernel-headers/5.10.12/200.fc33/x86_64/kernel-headers-5.10.12-200.fc33.x86_64.rpm (IP: 38.145.60.20)
+ dnf -y install 'kernel-headers-5.10.*' --setopt=install_weak_deps=False --best
Last metadata expiration check: 0:00:19 ago on Mon Mar 15 01:44:33 2021.
Package kernel-headers-5.10.20-200.fc33.x86_64 is already installed.
Dependencies resolved.
Nothing to do.
Complete!
+ dnf -q -y install kernel-devel-5.10.12-200.fc33.x86_64 --setopt=install_weak_deps=False --best
Error: Unable to find a match: kernel-devel-5.10.12-200.fc33.x86_64
+ dnf -y install https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-devel-5.10.12-200.fc33.x86_64.rpm --setopt=install_weak_deps=False --best
+ ln -s /usr/src/kernels/5.10.12-200.fc33.x86_64 /lib/modules/5.10.12-200.fc33.x86_64/build
Installing Linux kernel module files...
+ echo 'Installing Linux kernel module files...'
++ pwd
+ dnf install -y -q kernel-core-5.10.12-200.fc33.x86_64 --setopt=install_weak_deps=False --best --downloadonly --downloaddir=/tmp/tmp.wOhJUJdwsn
Error: Unable to find a match: kernel-core-5.10.12-200.fc33.x86_64
++ pwd
+ curl https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/x86_64/kernel-core-5.10.12-200.fc33.x86_64.rpm -o /tmp/tmp.wOhJUJdwsn/kernel-core-5.10.12-200.fc33.x86_64.rpm
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 33.1M 0 8192 0 0 10163 0 0:57:04 --:--:-- 0:57:04 10163
1 33.1M 1 678k 0 0 371k 0 0:01:31 0:00:01 0:01:30 371k
8 33.1M 8 2782k 0 0 1004k 0 0:00:33 0:00:02 0:00:31 1003k
13 33.1M 13 4638k 0 0 1255k 0 0:00:27 0:00:03 0:00:24 1255k
19 33.1M 19 6654k 0 0 1411k 0 0:00:24 0:00:04 0:00:20 1411k
25 33.1M 25 8742k 0 0 1536k 0 0:00:22 0:00:05 0:00:17 1787k
32 33.1M 32 10.6M 0 0 1628k 0 0:00:20 0:00:06 0:00:14 2100k
38 33.1M 38 12.8M 0 0 1712k 0 0:00:19 0:00:07 0:00:12 2111k
44 33.1M 44 14.6M 0 0 1723k 0 0:00:19 0:00:08 0:00:11 2069k
50 33.1M 50 16.6M 0 0 1764k 0 0:00:19 0:00:09 0:00:10 2098k
55 33.1M 55 18.3M 0 0 1759k 0 0:00:19 0:00:10 0:00:09 2013k
58 33.1M 58 19.5M 0 0 1710k 0 0:00:19 0:00:11 0:00:08 1820k
61 33.1M 61 20.3M 0 0 1641k 0 0:00:20 0:00:12 0:00:08 1532k
63 33.1M 63 21.1M 0 0 1584k 0 0:00:21 0:00:13 0:00:08 1341k
65 33.1M 65 21.8M 0 0 1519k 0 0:00:22 0:00:14 0:00:08 1049k
67 33.1M 67 22.5M 0 0 1467k 0 0:00:23 0:00:15 0:00:08 842k
69 33.1M 69 23.0M 0 0 1414k 0 0:00:24 0:00:16 0:00:08 726k
71 33.1M 71 23.7M 0 0 1376k 0 0:00:24 0:00:17 0:00:07 700k
73 33.1M 73 24.4M 0 0 1339k 0 0:00:25 0:00:18 0:00:07 667k
75 33.1M 75 25.0M 0 0 1301k 0 0:00:26 0:00:19 0:00:07 655k
77 33.1M 77 25.6M 0 0 1267k 0 0:00:26 0:00:20 0:00:06 638k
78 33.1M 78 26.2M 0 0 1238k 0 0:00:27 0:00:21 0:00:06 645k
80 33.1M 80 26.6M 0 0 1203k 0 0:00:28 0:00:22 0:00:06 592k
81 33.1M 81 27.1M 0 0 1172k 0 0:00:28 0:00:23 0:00:05 550k
82 33.1M 82 27.5M 0 0 1141k 0 0:00:29 0:00:24 0:00:05 513k
84 33.1M 84 27.9M 0 0 1114k 0 0:00:30 0:00:25 0:00:05 481k
85 33.1M 85 28.4M 0 0 1090k 0 0:00:31 0:00:26 0:00:05 450k
87 33.1M 87 28.8M 0 0 1067k 0 0:00:31 0:00:27 0:00:04 450k
88 33.1M 88 29.3M 0 0 1047k 0 0:00:32 0:00:28 0:00:04 452k
89 33.1M 89 29.8M 0 0 1028k 0 0:00:33 0:00:29 0:00:04 471k
91 33.1M 91 30.4M 0 0 1014k 0 0:00:33 0:00:30 0:00:03 502k
93 33.1M 93 31.1M 0 0 1006k 0 0:00:33 0:00:31 0:00:02 555k
96 33.1M 96 32.0M 0 0 1004k 0 0:00:33 0:00:32 0:00:01 653k
100 33.1M 100 33.1M 0 0 1010k 0 0:00:33 0:00:33 --:--:-- 799k
+ cat ./kernel-core-5.10.12-200.fc33.x86_64.rpm
+ rpm2cpio
+ cpio -idm --quiet
+ rm ./kernel-core-5.10.12-200.fc33.x86_64.rpm
+ mv lib/modules/5.10.12-200.fc33.x86_64/modules.block lib/modules/5.10.12-200.fc33.x86_64/modules.builtin lib/modules/5.10.12-200.fc33.x86_64/modules.builtin.alias.bin lib/modules/5.10.12-200.fc33.x86_64/modules.builtin.modinfo lib/modules/5.10.12-200.fc33.x86_64/modules.drm lib/modules/5.10.12-200.fc33.x86_64/modules.modesetting lib/modules/5.10.12-200.fc33.x86_64/modules.networking lib/modules/5.10.12-200.fc33.x86_64/modules.order /lib/modules/5.10.12-200.fc33.x86_64
+ mv lib/modules/5.10.12-200.fc33.x86_64/kernel /lib/modules/5.10.12-200.fc33.x86_64
+ depmod 5.10.12-200.fc33.x86_64
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
During the Installation of Nvidia GPU-operator on OKD 4.6 cluster has had problem with KOJI URLs.
2. Steps to reproduce the issue
Prepare Nvidia driver for Fedora CoreOS 33 on local repository and install gpu-operator by helm.
3. Information to attach (optional if deemed irrelevant)
[ ] kubernetes pods status:
kubectl get pods --all-namespaces
[ ] kubernetes daemonset status:
kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine:
docker run -it alpine echo foo
[ ] Docker configuration file:
cat /etc/docker/daemon.json
[ ] Docker runtime configuration:
docker info | grep runtime
[ ] NVIDIA shared directory:
ls -la /run/nvidia
[ ] NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory:
ls -la /run/nvidia/driver
[ ] kubelet logs
journalctl -u kubelet > kubelet.logs
I've used GPU-Operator plugin on OKD 4.5 recently, and now I would like to use it on OKD 4.6 latest stable version. So I tried to install it on my OKD 4.6 cluster, but error logs are appeared on 'nvidia-driver-daemonset' pod as below:
According to error logs, it seems could not find the https://kojipkgs.fedoraproject.org/packages/kernel/5.9.16 pages. The results of accessing the pages on my explorer is below (Could not find '5.9.16' versions)
Where can I change the version values to use existed version? or Do I have to wait for updating of that?