Nvidia driver install fails on pod nvidia-driver-daemonset - OpenShift 4.11

mbio16 commented 2 years ago

1. Quick Debug Checklist

[x] Are you running Openshift v4.11.13.
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?

1. Issue or feature description

Pod nvidia-driver-daemonset contains container openshift-driver-toolkit-ctr which want to compile driver for Nvidia GPU version 515.65.01. Driver install fails. Logs from container:

+ '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']'
+ exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
Running dtk-build-driver
Start building nvidia.ko driver ...
DRIVER_ARCH is x86_64
+ set -o allexport
+ source /mnt/shared-nvidia-driver-toolkit/env
++ LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
++ DRIVER_VERSION=515.65.01
++ GPU_OPERATOR_SERVICE_PORT_GPU_OPERATOR_METRICS=8080
++ HOSTNAME=nvidia-driver-daemonset-411.86.202210311708-0-c6qqz
++ NVIDIA_NODE_STATUS_EXPORTER_PORT_8000_TCP=tcp://172.30.92.123:8000
++ NVIDIA_DCGM_EXPORTER_PORT_9400_TCP_ADDR=172.30.202.100
++ GPU_OPERATOR_PORT_8080_TCP=tcp://172.30.46.13:8080
++ DRIVER_TYPE=passthrough
++ NVIDIA_DCGM_EXPORTER_SERVICE_HOST=172.30.202.100
++ NVIDIA_VISIBLE_DEVICES=void
++ NVIDIA_NODE_STATUS_EXPORTER_SERVICE_PORT_NODE_STATUS=8000
++ KUBERNETES_PORT_443_TCP_PROTO=tcp
++ KUBERNETES_PORT_443_TCP_ADDR=172.30.0.1
++ container=oci
++ KUBERNETES_PORT=tcp://172.30.0.1:443
++ PWD=/drivers
++ NVARCH=x86_64
++ HOME=/root
++ VGPU_LICENSE_SERVER_TYPE=FNE
++ GPU_OPERATOR_SERVICE_HOST=172.30.46.13
++ GPU_OPERATOR_PORT=tcp://172.30.46.13:8080
++ KUBERNETES_SERVICE_PORT_HTTPS=443
++ NVIDIA_DCGM_EXPORTER_PORT=tcp://172.30.202.100:9400
++ KUBERNETES_PORT_443_TCP_PORT=443
++ RHEL_VERSION=8.6
++ TARGETARCH=amd64
++ GPU_OPERATOR_PORT_8080_TCP_PROTO=tcp
++ GPU_OPERATOR_PORT_8080_TCP_ADDR=172.30.46.13
++ NVIDIA_DCGM_EXPORTER_PORT_9400_TCP=tcp://172.30.202.100:9400
++ NVIDIA_NODE_STATUS_EXPORTER_SERVICE_PORT=8000
++ KUBERNETES_PORT_443_TCP=tcp://172.30.0.1:443
++ NVIDIA_NODE_STATUS_EXPORTER_PORT_8000_TCP_PORT=8000
++ GPU_OPERATOR_SERVICE_PORT=8080
++ NVIDIA_NODE_STATUS_EXPORTER_PORT_8000_TCP_ADDR=172.30.92.123
++ TERM=xterm
++ CUDA_VERSION=11.7.1
++ GPU_OPERATOR_PORT_8080_TCP_PORT=8080
++ NVIDIA_DCGM_EXPORTER_PORT_9400_TCP_PORT=9400
++ NSS_SDB_USE_CACHE=no
++ NVIDIA_NODE_STATUS_EXPORTER_PORT_8000_TCP_PROTO=tcp
++ NVIDIA_DRIVER_CAPABILITIES=compute,utility
++ SHLVL=1
++ NVIDIA_DCGM_EXPORTER_SERVICE_PORT=9400
++ NVIDIA_REQUIRE_CUDA='cuda>=11.7 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511'
++ KUBERNETES_SERVICE_PORT=443
++ NV_CUDA_CUDART_VERSION=11.7.99-1
++ NVIDIA_DCGM_EXPORTER_PORT_9400_TCP_PROTO=tcp
++ NVIDIA_NODE_STATUS_EXPORTER_SERVICE_HOST=172.30.92.123
++ PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
++ NVIDIA_NODE_STATUS_EXPORTER_PORT=tcp://172.30.92.123:8000
++ NVIDIA_DCGM_EXPORTER_SERVICE_PORT_GPU_METRICS=9400
++ OPENSHIFT_VERSION=4.11
++ KUBERNETES_SERVICE_HOST=172.30.0.1
++ DISABLE_VGPU_VERSION_CHECK=false
++ _=/usr/bin/env
+ set +o allexport
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
+ rm -rf /mnt/shared-nvidia-driver-toolkit/drivers/NVIDIA-Linux-x86_64-515.65.01
+ sed s/elfutils-libelf.x86_64// -i /mnt/shared-nvidia-driver-toolkit/nvidia-driver
+ sed 's|rm -rf /lib/modules/${KERNEL_VERSION}/video||' -i /mnt/shared-nvidia-driver-toolkit/nvidia-driver
+ sed 's|rm -rf /lib/modules/${KERNEL_VERSION}||' -i /mnt/shared-nvidia-driver-toolkit/nvidia-driver
+ mkdir /mnt/shared-nvidia-driver-toolkit/bin -p
+ cp -v /mnt/shared-nvidia-driver-toolkit/nvidia-driver /mnt/shared-nvidia-driver-toolkit/common.sh /mnt/shared-nvidia-driver-toolkit/extract-vmlinux /mnt/shared-nvidia-driver-toolkit/vgpu-util /mnt/shared-nvidia-driver-toolkit/bin
'/mnt/shared-nvidia-driver-toolkit/nvidia-driver' -> '/mnt/shared-nvidia-driver-toolkit/bin/nvidia-driver'
'/mnt/shared-nvidia-driver-toolkit/common.sh' -> '/mnt/shared-nvidia-driver-toolkit/bin/common.sh'
'/mnt/shared-nvidia-driver-toolkit/extract-vmlinux' -> '/mnt/shared-nvidia-driver-toolkit/bin/extract-vmlinux'
'/mnt/shared-nvidia-driver-toolkit/vgpu-util' -> '/mnt/shared-nvidia-driver-toolkit/bin/vgpu-util'
++ which true
+ ln -s /usr/bin/true /mnt/shared-nvidia-driver-toolkit/bin/dnf --force
+ export PATH=/mnt/shared-nvidia-driver-toolkit/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ PATH=/mnt/shared-nvidia-driver-toolkit/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ cp /mnt/shared-nvidia-driver-toolkit/install.sh /tmp/
+ cd /mnt/shared-nvidia-driver-toolkit/drivers
#
# Executing nvidia-driver build script ...
+ echo '#'
+ echo '# Executing nvidia-driver build script ...'
+ echo '#'
#
+ bash -x /mnt/shared-nvidia-driver-toolkit/nvidia-driver build --tag builtin
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=515.65.01
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
DRIVER_ARCH is x86_64
+++ dirname -- /mnt/shared-nvidia-driver-toolkit/nvidia-driver
++ cd -- /mnt/shared-nvidia-driver-toolkit
++ pwd
+ SCRIPT_DIR=/mnt/shared-nvidia-driver-toolkit
+ source /mnt/shared-nvidia-driver-toolkit/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
+ '[' 3 -eq 0 ']'
+ command=build
+ shift
+ case "${command}" in
++ getopt -l accept-license,tag: -o a:t -- --tag builtin
+ options=' --tag '\''builtin'\'' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --tag '\''builtin'\'' --'
++ set -- --tag builtin --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-372.32.1.el8_6.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ PACKAGE_TAG=builtin
+ shift 2
+ for opt in ${options}
+ case "$opt" in
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ return 0
+ build
+ _prepare
+ '[' passthrough = vgpu ']'
+ sh NVIDIA-Linux-x86_64-515.65.01.run -x
Creating directory NVIDIA-Linux-x86_64-515.65.01
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 515.65.01................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
+ cd NVIDIA-Linux-x86_64-515.65.01
+ sh /tmp/install.sh nvinstall
DRIVER_ARCH is x86_64

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

+ mkdir -p /usr/src/nvidia-515.65.01
+ mv LICENSE mkprecompiled kernel /usr/src/nvidia-515.65.01
+ sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest
+ echo -e '\n========== NVIDIA Software Installer ==========\n'

========== NVIDIA Software Installer ==========

+ echo -e 'Starting installation of NVIDIA driver version 515.65.01 for Linux kernel version 4.18.0-372.32.1.el8_6.x86_64\n'
Starting installation of NVIDIA driver version 515.65.01 for Linux kernel version 4.18.0-372.32.1.el8_6.x86_64

+ _build
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
Checking NVIDIA driver packages...
+ [[ ! -d /usr/src/nvidia-515.65.01/kernel ]]
+ cd /usr/src/nvidia-515.65.01/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-372.32.1.el8_6.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' builtin '!=' builtin ']'
+ _install_prerequisites
++ mktemp -d
Installing elfutils...
+ local tmp_dir=/tmp/tmp.2drLybZT9e
+ trap 'rm -rf /tmp/tmp.2drLybZT9e' EXIT
+ cd /tmp/tmp.2drLybZT9e
+ echo 'Installing elfutils...'
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ mkdir -p /lib/modules/4.18.0-372.32.1.el8_6.x86_64/proc
Enabling RHOCP and EUS RPM repos...
+ echo 'Enabling RHOCP and EUS RPM repos...'
+ '[' -n 4.11 ']'
+ dnf config-manager --set-enabled rhocp-4.11-for-rhel-8-x86_64-rpms
+ dnf makecache --releasever=8.6
+ dnf config-manager --set-enabled rhel-8-for-x86_64-baseos-eus-rpms
+ dnf makecache --releasever=8.6
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.6 install kernel-headers-4.18.0-372.32.1.el8_6.x86_64 kernel-devel-4.18.0-372.32.1.el8_6.x86_64
+ ln -s /usr/src/kernels/4.18.0-372.32.1.el8_6.x86_64 /lib/modules/4.18.0-372.32.1.el8_6.x86_64/build
Installing Linux kernel module files...
+ echo 'Installing Linux kernel module files...'
+ dnf -q -y --releasever=8.6 install kernel-core-4.18.0-372.32.1.el8_6.x86_64
+ touch /lib/modules/4.18.0-372.32.1.el8_6.x86_64/modules.order
+ touch /lib/modules/4.18.0-372.32.1.el8_6.x86_64/modules.builtin
+ depmod 4.18.0-372.32.1.el8_6.x86_64
+ echo 'Generating Linux kernel version string...'
+ '[' amd64 = arm64 ']'
Generating Linux kernel version string...
+ extract-vmlinux /lib/modules/4.18.0-372.32.1.el8_6.x86_64/vmlinuz
+ strings
+ sed 's/^\(.*\)\s\+(.*)$/\1/'
+ grep -E '^Linux version'
+ '[' -z 'Linux version 4.18.0-372.32.1.el8_6.x86_64 (mockbuild@x86-vm-08.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Fri Oct 7 12:35:10 EDT 2022' ']'
+ mv version /lib/modules/4.18.0-372.32.1.el8_6.x86_64/proc
++ cat /lib/modules/4.18.0-372.32.1.el8_6.x86_64/proc/version
++ grep -Eo 'gcc version ([0-9\.]+)'
++ grep -Eo '([0-9\.]+)'
+ local gcc_version=8.5.0
++ rpm -qa gcc
+ local current_gcc=gcc-8.5.0-10.1.el8_6.x86_64
kernel requires gcc version: 'gcc-8.5.0', current gcc version is 'gcc-8.5.0-10.1.el8_6.x86_64'
+ echo 'kernel requires gcc version: '\''gcc-8.5.0'\'', current gcc version is '\''gcc-8.5.0-10.1.el8_6.x86_64'\'''
+ [[ gcc-8.5.0 != \g\c\c\-\8\.\5\.\0\-\1\0\.\1\.\e\l\8\_\6\.\x\8\6\_\6\4* ]]
+ dnf install -q -y --releasever=8.6 gcc-8.5.0
++ rm -rf /tmp/tmp.2drLybZT9e
+ _create_driver_package
+ local pkg_name=nvidia-modules-4.18.0-builtin
+ local nvidia_sign_args=
+ local nvidia_modeset_sign_args=
+ local nvidia_uvm_sign_args=
+ trap 'make -s -j SYSSRC=/lib/modules/4.18.0-372.32.1.el8_6.x86_64/build clean > /dev/null' EXIT
+ echo 'Compiling NVIDIA driver kernel modules...'
Compiling NVIDIA driver kernel modules...
+ cd /usr/src/nvidia-515.65.01/kernel
+ _gpu_direct_rdma_enabled
+ '[' false = true ']'
+ return 1
+ make -s -j SYSSRC=/lib/modules/4.18.0-372.32.1.el8_6.x86_64/build nv-linux.o nv-modeset-linux.o
/usr/src/nvidia-515.65.01/kernel/nvidia/nv-dma.c: In function 'nv_dma_use_map_resource':
/usr/src/nvidia-515.65.01/kernel/nvidia/nv-dma.c:783:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
     const struct dma_map_ops *ops = get_dma_ops(dma_dev->dev);
     ^~~~~
In file included from ./include/linux/kernel.h:14,
                 from ./include/linux/list.h:9,
                 from ./include/linux/preempt.h:11,
                 from ./include/linux/spinlock.h:55,
                 from /usr/src/nvidia-515.65.01/kernel/common/inc/nv-lock.h:29,
                 from /usr/src/nvidia-515.65.01/kernel/common/inc/nv-linux.h:32,
                 from /usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.c:26:
/usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.c: In function 'nv_get_max_sysmem_address':
./include/linux/minmax.h:18:28: warning: comparison of distinct pointer types lacks a cast
  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                            ^~
./include/linux/minmax.h:32:4: note: in expansion of macro '__typecheck'
   (__typecheck(x, y) && __no_side_effects(x, y))
    ^~~~~~~~~~~
./include/linux/minmax.h:42:24: note: in expansion of macro '__safe_cmp'
  __builtin_choose_expr(__safe_cmp(x, y), \
                        ^~~~~~~~~~
./include/linux/minmax.h:58:19: note: in expansion of macro '__careful_cmp'
 #define max(x, y) __careful_cmp(x, y, >)
                   ^~~~~~~~~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.c:225:26: note: in expansion of macro 'max'
         global_max_pfn = max(global_max_pfn, node_end_pfn(node_id));
                          ^~~
/usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.c: In function 'uvm_pmm_gpu_alloc_kernel':
/usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.c:645:16: warning: unused variable 'gpu' [-Wunused-variable]
     uvm_gpu_t *gpu = uvm_pmm_to_gpu(pmm);
                ^~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'cursor_plane_req_config_update':
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:88:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
     struct nv_drm_plane_state *nv_drm_plane_state =
                                ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:87:27: warning: unused variable 'nv_dev' [-Wunused-variable]
     struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
                           ^~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'plane_req_config_update':
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:189:9: warning: unused variable 'ret' [-Wunused-variable]
     int ret = 0;
         ^~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_plane_atomic_set_property':
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:504:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
     struct nv_drm_plane_state *nv_drm_plane_state =
                                ^~~~~~~~~~~~~~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_enumerate_crtcs_and_planes':
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:1148:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
             struct drm_plane *overlay_plane =
             ^~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event':
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable 'overlay_event' [-Wunused-variable]
     bool overlay_event = false;
          ^~~~~~~~~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable 'primary_event' [-Wunused-variable]
     bool primary_event = false;
          ^~~~~~~~~~~~~
/usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable]
     struct drm_plane *primary_plane = crtc->primary;
                       ^~~~~~~~~~~~~
Relinking NVIDIA driver kernel modules...
+ echo 'Relinking NVIDIA driver kernel modules...'
+ rm -f nvidia.ko nvidia-modeset.ko
+ ld -d -r -o nvidia.ko ./nv-linux.o ./nvidia/nv-kernel.o_binary
+ ld -d -r -o nvidia-modeset.ko ./nv-modeset-linux.o ./nvidia-modeset/nv-modeset-kernel.o_binary
+ '[' -n '' ']'
Building NVIDIA driver package nvidia-modules-4.18.0-builtin...
+ echo 'Building NVIDIA driver package nvidia-modules-4.18.0-builtin...'
+ ../mkprecompiled --pack nvidia-modules-4.18.0-builtin --description 4.18.0-372.32.1.el8_6.x86_64 --proc-mount-point /lib/modules/4.18.0-372.32.1.el8_6.x86_64/proc --driver-version 515.65.01 --kernel-interface nv-linux.o --linked-module-name nvidia.ko --core-object-name nvidia/nv-kernel.o_binary --target-directory . --kernel-interface nv-modeset-linux.o --linked-module-name nvidia-modeset.ko --core-object-name nvidia-modeset/nv-modeset-kernel.o_binary --target-directory . --kernel-module nvidia-uvm.ko --target-directory .
+ mkdir -p precompiled
+ mv nvidia-modules-4.18.0-builtin precompiled
++ make -s -j SYSSRC=/lib/modules/4.18.0-372.32.1.el8_6.x86_64/build clean
+ _cleanup_package_cache
+ '[' builtin '!=' builtin ']'
Installing NVIDIA driver kernel modules...
+ _install_driver
+ install_args=()
+ local install_args
+ echo 'Installing NVIDIA driver kernel modules...'
+ cd /usr/src/nvidia-515.65.01
+ '[' '' = yes ']'
+ IGNORE_CC_MISMATCH=1
+ nvidia-installer --kernel-module-only --no-drm --ui=none --no-nouveau-check

WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.

ERROR: Unable to open 'kernel/dkms.conf' for copying (No such file or directory)

Welcome to the NVIDIA Software Installer for Unix/Linux

Detected 4 CPUs online; setting concurrency level to 4.
Installing NVIDIA driver version 515.65.01.
A precompiled kernel interface for kernel '4.18.0-372.32.1.el8_6.x86_64' has been found here: ./kernel/precompiled/nvidia-modules-4.18.0-builtin.
Required file 'nvidia-peermem.ko' not found in package './kernel/precompiled/nvidia-modules-4.18.0-builtin'
Performing CC sanity check with CC="/usr/bin/cc".
Performing CC check.
Kernel source path: '/lib/modules/4.18.0-372.32.1.el8_6.x86_64/source'

Kernel output path: '/lib/modules/4.18.0-372.32.1.el8_6.x86_64/build'

Performing Compiler check.
Performing Dom0 check.
Performing Xen check.
Performing PREEMPT_RT check.
Performing vgpu_kvm check.
Cleaning kernel module build directory.
Building kernel modules
  : [##############################] 100%

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.

ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

2. Steps to reproduce the issue

Install Openshift 4.11.13
workers with Nvidia A100 HGX passthrought
Install Node Feature Discovery Operator v4.11.0-202210262118
Create NodeFeatureDiscovery to label nodes
Install NVIDIA GPU Operator v.22.9.0
Create ClusterPolicy with default values
Pods nvidia-driver-daemonset keep restarting after driver install fail

shivamerla commented 2 years ago

@mbio16 Can you attach the output of "journalctl -xb" on this node, that might indicate the actual error when we attempt to load nvidia modules. Also, can you share more details on the system? Which hypervisor and version? EFI boot?

mbio16 commented 2 years ago

I will.

system info:

3 worker nodes (2 of them have GPU)
3 control plane nodes
GPU A100 HGX 80 GB
hypervisor vmware ESXI v 7.0.3
boot - BIOS

shivamerla commented 2 years ago

The VM has to be configured with EFI boot and following PCI params are required for VM config. The above error during driver load is seen without these settings.

pciPassthru.use64bitMMIO=”TRUE”
pciPassthru.64bitMMIOSizeGB=128

mbio16 commented 2 years ago

@shivamerla PCI params has been set for these VMs with BIOS booting. I tried to make new VM node and used EFI boot. However, the VM starts and shuts down Immediately. Tried with secure boot option and without. Both has the same result. When I set BIOS, Core os boot normally. ISO for booting is the one generated by OpenShift page - Install OpenShift with the Assisted Installer.

mbio16 commented 1 year ago

Hi,

the solution is to run EFI without secure boot. BIOS mode caused kernel compile error. EFI boot with secure boot caused error connected to driver signature. EFI boot without secure option is valid worker install that will work with NVIDIA GPU operator.

Hope this comment helps more admins from struggling.

NVIDIA / gpu-operator