NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.76k stars 285 forks source link

Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04 #504

Open Bec-k opened 1 year ago

Bec-k commented 1 year ago

Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04.

Logs are taken from nvidia driver installation daemonset's pod nvidia-driver-daemonset-srf9k:

nvidia-driver-daemonset-47m44 nvidia-driver-ctr DRIVER_ARCH is x86_64
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Creating directory NVIDIA-Linux-x86_64-515.65.01
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Verifying archive integrity... OK
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 515.65.01................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr ========== NVIDIA Software Installer ==========
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Starting installation of NVIDIA driver version 515.65.01 for Linux kernel version 5.15.0-58-generic
nvidia-driver-daemonset-47m44 nvidia-driver-ctr
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Checking NVIDIA driver packages...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Updating the package cache...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Resolving Linux kernel version...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Proceeding with Linux kernel version 5.15.0-58-generic
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Installing Linux kernel headers...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Installing Linux kernel module files...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Generating Linux kernel version string...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Compiling NVIDIA driver kernel modules...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr warning: the compiler differs from the one used to build the kernel
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   You are using:           cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-i2c.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** Waiting for unfinished jobs....
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-pci.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-dmabuf.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-acpi.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-mmap.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-dma.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-p2p.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-procfs.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-cray.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/os-interface.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-pat.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-usermap.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/os-pci.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-vtophys.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/os-usermap.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-modeset-interface.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/os-mlock.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/os-registry.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-memdbg.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-report-err.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-ibmnpu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-caps.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-frontend.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nvlink_linux.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/linux_nvswitch.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_common.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv_uvm_interface.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-msi.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/nv-rsync.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/procfs_nvswitch.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_global.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/nv-kthread-q-selftest.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia/i2c_nvswitch.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_linux.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_procfs.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_gpu_isr.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_gpu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_va_space.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:298: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_tools.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_va_space_mm.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_hal.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_lock.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_rm_mem.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_gpu_semaphore.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_mem.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_channel.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_range_tree.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_gpu_non_replayable_faults.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_va_policy.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_range_allocator.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_va_range.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_gpu_replayable_faults.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_va_block.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_range_group.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_gpu_access_counters.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pte_batch.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_maxwell.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_tlb_batch.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_thread_context.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_perf_module.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_perf_events.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_maxwell_host.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pushbuffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_tracker.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_mmu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_push.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_maxwell_access_counter_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pascal.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_maxwell_fault_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_maxwell_ce.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pascal_host.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_maxwell_mmu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pascal_mmu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:298: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_volta.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pascal_ce.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pascal_fault_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_volta_mmu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_volta_fault_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_turing_mmu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_turing_fault_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_turing.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   445 |     int status = 0;
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |     ^~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_turing_access_counter_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_volta_host.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_volta_access_counter_buffer.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_ampere_host.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_ampere_ce.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable 'overlay_event' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr    98 |     bool overlay_event = false;
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |          ^~~~~~~~~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable 'primary_event' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr    97 |     bool primary_event = false;
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |          ^~~~~~~~~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr    96 |     struct drm_plane *primary_plane = crtc->primary;
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |                       ^~~~~~~~~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_sysmem.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'cursor_plane_req_config_update':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:88:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr    88 |     struct nv_drm_plane_state *nv_drm_plane_state =
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |                                ^~~~~~~~~~~~~~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:87:27: warning: unused variable 'nv_dev' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr    87 |     struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |                           ^~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'plane_req_config_update':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:189:9: warning: unused variable 'ret' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   189 |     int ret = 0;
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |         ^~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_plane_atomic_set_property':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:504:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   504 |     struct nv_drm_plane_state *nv_drm_plane_state =
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |                                ^~~~~~~~~~~~~~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_enumerate_crtcs_and_planes':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:1148:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr  1148 |             struct drm_plane *overlay_plane =
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |             ^~~~~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_turing_host.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_policy.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.c: In function 'uvm_pmm_gpu_alloc_kernel':
nvidia-driver-daemonset-47m44 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.c:645:16: warning: unused variable 'gpu' [-Wunused-variable]
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   645 |     uvm_gpu_t *gpu = uvm_pmm_to_gpu(pmm);
nvidia-driver-daemonset-47m44 nvidia-driver-ctr       |                ^~~
nvidia-driver-daemonset-47m44 nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.o] Error 1
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make[1]: *** [Makefile:1902: /usr/src/nvidia-515.65.01/kernel] Error 2
nvidia-driver-daemonset-47m44 nvidia-driver-ctr make: *** [Makefile:82: modules] Error 2
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-47m44 nvidia-driver-ctr Unmounting NVIDIA driver rootfs...

I'm more concerned about this:

nvidia-driver-daemonset-47m44 nvidia-driver-ctr warning: the compiler differs from the one used to build the kernel
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-47m44 nvidia-driver-ctr   You are using:           cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0

I have checked, we have gcc installed on that machine and it is actually exactly gcc 11.3.0:

root@vultr:~# apt-cache policy gcc-11
gcc-11:
  Installed: 11.3.0-1ubuntu1~22.04
  Candidate: 11.3.0-1ubuntu1~22.04
  Version table:
 *** 11.3.0-1ubuntu1~22.04 500
        500 http://us.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        500 http://us.archive.ubuntu.com/ubuntu jammy-security/main amd64 Packages
        100 /var/lib/dpkg/status
     11.2.0-19ubuntu1 500
        500 http://us.archive.ubuntu.com/ubuntu jammy/main amd64 Packages

and:

root@vultr:~# gcc --version
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Bec-k commented 1 year ago

Using docker image: nvcr.io/nvidia/driver:515.65.01-ubuntu22.04

Bec-k commented 1 year ago

And nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2

Bec-k commented 1 year ago

Using operator gpu-operator-v22.9.0

Bec-k commented 1 year ago

Upgraded gpu-operator to the latest version v22.9.2, waiting for it to be compiled.

nvidia-driver-daemonset-tgncp nvidia-driver-ctr ========== NVIDIA Software Installer ==========
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Checking NVIDIA driver packages...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Updating the package cache...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Resolving Linux kernel version...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Proceeding with Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel headers...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel module files...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Generating Linux kernel version string...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Compiling NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr warning: the compiler differs from the one used to build the kernel
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Shouldn't it be installed without compiling?

Bec-k commented 1 year ago

Compilation is kinda overwhelming for a smaller GPU nodes: image

Bec-k commented 1 year ago

Swap is used, it's aroun 12gb memory used to compile. That's a lot, should i be like that? I know that it's not gpu-operator project code, just asking around. Looks like something is off there.

Bec-k commented 1 year ago

It is failing as well, here is full log of driver manager and driver installer:

+ nvidia-driver-daemonset-tgncp › nvidia-driver-ctr
+ nvidia-driver-daemonset-tgncp › k8s-driver-manager
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.mig-manager='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nodeType='
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of AUTO_UPGRADE_POLICY_ENABLED='
nvidia-driver-daemonset-tgncp k8s-driver-manager Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 labeled
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager pod/nvidia-operator-validator-jwq5h condition met
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for the container-toolkit to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager pod/nvidia-container-toolkit-daemonset-g4hd6 condition met
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for the device-plugin to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for gpu-feature-discovery to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for dcgm-exporter to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for dcgm to shutdown
nvidia-driver-daemonset-tgncp nvidia-driver-ctr DRIVER_ARCH is x86_64
nvidia-driver-daemonset-tgncp k8s-driver-manager Auto upgrade policy of the GPU driver on the node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 is disabled
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Creating directory NVIDIA-Linux-x86_64-525.60.13
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Verifying archive integrity... OK
nvidia-driver-daemonset-tgncp k8s-driver-manager Cordoning node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7...
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 cordoned
nvidia-driver-daemonset-tgncp k8s-driver-manager Draining node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 of any GPU pods...
nvidia-driver-daemonset-tgncp k8s-driver-manager W0321 11:51:52.622197   20244 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
nvidia-driver-daemonset-tgncp k8s-driver-manager time="2023-03-21T11:51:52Z" level=info msg="Identifying GPU pods to delete"
nvidia-driver-daemonset-tgncp k8s-driver-manager time="2023-03-21T11:51:52Z" level=info msg="No GPU pods to delete. Exiting."
nvidia-driver-daemonset-tgncp k8s-driver-manager unbinding device 0000:04:00.0
nvidia-driver-daemonset-tgncp k8s-driver-manager Auto upgrade policy of the GPU driver on the node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 is disabled
nvidia-driver-daemonset-tgncp k8s-driver-manager Uncordoning node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7...
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 uncordoned
nvidia-driver-daemonset-tgncp k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 labeled
- nvidia-driver-daemonset-tgncp › k8s-driver-manager
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr ========== NVIDIA Software Installer ==========
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Checking NVIDIA driver packages...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Updating the package cache...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Resolving Linux kernel version...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Proceeding with Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel headers...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel module files...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Generating Linux kernel version string...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Compiling NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr warning: the compiler differs from the one used to build the kernel
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   445 |     int status = 0;
nvidia-driver-daemonset-tgncp nvidia-driver-ctr       |     ^~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia/nv.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** Waiting for unfinished jobs....
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hopper_mmu.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hopper_host.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia/os-registry.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia/nv-mmap.c: In function 'nv_encode_caching':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia/nv-mmap.c:353:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   353 |             if (NV_ALLOW_CACHING(memory_type))
nvidia-driver-daemonset-tgncp nvidia-driver-ctr       |                ^
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia/nv-mmap.c:356:9: note: here
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   356 |         default:
nvidia-driver-daemonset-tgncp nvidia-driver-ctr       |         ^~~~~~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia/nv-caps.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_tools.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_procfs.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hopper_ce.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:298: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hal.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_va_range.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_va_space.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-crtc.c:678:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   678 |     struct nv_drm_plane_state *nv_drm_plane_state =
nvidia-driver-daemonset-tgncp nvidia-driver-ctr       |     ^~~~~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel_test.c: In function 'test_unexpected_completed_values':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel_test.c:156:15: warning: unused variable 'status' [-Wunused-variable]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr   156 |     NV_STATUS status;
nvidia-driver-daemonset-tgncp nvidia-driver-ctr       |               ^~~~~~

nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[1]: *** [Makefile:1906: /usr/src/nvidia-525.60.13/kernel] Error 2
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make: *** [Makefile:82: modules] Error 2
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unmounting NVIDIA driver rootfs...

It is retrying that installation and always failing.

Will try to install it on Ubuntu latest non-LTS

Bec-k commented 1 year ago

ok, interesting, but with Ubuntu 22.10 it worked:

nvidia-driver-daemonset-x48wf k8s-driver-manager Tue Mar 21 12:53:42 2023       
nvidia-driver-daemonset-x48wf k8s-driver-manager +-----------------------------------------------------------------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager | NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
nvidia-driver-daemonset-x48wf k8s-driver-manager |-------------------------------+----------------------+----------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
nvidia-driver-daemonset-x48wf k8s-driver-manager | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
nvidia-driver-daemonset-x48wf k8s-driver-manager |                               |                      |               MIG M. |
nvidia-driver-daemonset-x48wf k8s-driver-manager |===============================+======================+======================|
nvidia-driver-daemonset-x48wf k8s-driver-manager |   0  GRID A100D-4C       On   | 00000000:04:00.0 Off |                    0 |
nvidia-driver-daemonset-x48wf k8s-driver-manager | N/A   N/A    P0    N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
nvidia-driver-daemonset-x48wf k8s-driver-manager |                               |                      |             Disabled |
nvidia-driver-daemonset-x48wf k8s-driver-manager +-------------------------------+----------------------+----------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager                                                                                
nvidia-driver-daemonset-x48wf k8s-driver-manager +-----------------------------------------------------------------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager | Processes:                                                                  |
nvidia-driver-daemonset-x48wf k8s-driver-manager |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
nvidia-driver-daemonset-x48wf k8s-driver-manager |        ID   ID                                                   Usage      |
nvidia-driver-daemonset-x48wf k8s-driver-manager |=============================================================================|
nvidia-driver-daemonset-x48wf k8s-driver-manager |  No running processes found                                                 |
nvidia-driver-daemonset-x48wf k8s-driver-manager +-----------------------------------------------------------------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager NVIDIA GPU driver is already pre-installed on the node, disabling the containerized driver on the node
nvidia-driver-daemonset-x48wf k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-67942f labeled

Even without driver installation, it was already there. I have checked Vultr logs when node was creating, they are pre-installing drivers before letting in. It seems that there was or is some problem with others Ubuntu distros and drivers versions etc.

Bec-k commented 1 year ago

There is now another problem, now with toolkit:

+ nvidia-container-toolkit-daemonset-zk8cn › nvidia-container-toolkit-ctr
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg=Initializing
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Removing existing NVIDIA container toolkit installation"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Finding library libnvidia-ml.so (root=/)"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-ml.so': error resolving link '/usr/lib64/libnvidia-ml.so': lstat /usr/lib64/libnvidia-ml.so: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so': error resolving link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so': lstat /usr/lib/x86_64-linux-gnu/libnvidia-ml.so: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/aarch64-linux-gnu/libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib/aarch64-linux-gnu/libnvidia-ml.so': error resolving link '/usr/lib/aarch64-linux-gnu/libnvidia-ml.so': lstat /usr/lib/aarch64-linux-gnu: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=warning msg="Error finding library path for root /: error locating NVIDIA management library: error locating library 'libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Using library root "
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr Using config:
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr accept-nvidia-visible-devices-as-volume-mounts = false
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr accept-nvidia-visible-devices-envvar-when-unprivileged = true
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr disable-require = false
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr 
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr [nvidia-container-cli]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   environment = []
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   ldconfig = "@/sbin/ldconfig.real"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   load-kmods = true
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   root = "/"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr 
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr [nvidia-container-runtime]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   log-level = "info"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   mode = "auto"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   runtimes = ["docker-runc", "runc"]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr 
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr   [nvidia-container-runtime.modes]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr 
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr     [nvidia-container-runtime.modes.csv]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr       mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Setting up runtime"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Starting 'setup' for containerd"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully parsed arguments"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully loaded config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Config version: 2"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Updating config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully updated config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Flushing config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully flushed config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Shutting Down"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: error running containerd command: signal: hangup"
Bec-k commented 1 year ago

Logs from gpu-operator pod:

gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7050278,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7056942,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yona-pool-sad-lovela-67942f","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7057347,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"scw-k8s-suspicious-yona-pool-sad-lovela-67942f"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.705751,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yona-pool-sad-lovela-e8d8cd","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.705805,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yonath-default-d556e605e276","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7060485,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yon-pool-quizzical-j-fe9338","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7060602,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7066238,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7068782,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7171223,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.717339,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7325332,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7460692,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7744899,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8070672,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8314767,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8554163,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.870571,"logger":"controllers.ClusterPolicy","msg":"5.19.0-29-generic","Request.Namespace":"default","Request.Name":"Node"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8736198,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-driver-daemonset","Namespace":"gpu-operator","name":"nvidia-driver-daemonset"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.874162,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8898356,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-container-toolkit","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.908736,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-container-toolkit","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9388857,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-container-toolkit","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9490485,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-container-toolkit-daemonset","Namespace":"gpu-operator","name":"nvidia-container-toolkit-daemonset"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9498417,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9706805,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.991417,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.017446,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.0417135,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.0707805,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.0964117,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"gpu-operator","name":"nvidia-operator-validator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.096793,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1137688,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1315594,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1568701,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1907294,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2151012,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2266555,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"gpu-operator","name":"nvidia-device-plugin-daemonset"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2272947,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2803714,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.295747,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3126926,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3409662,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3533444,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.365767,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"gpu-operator","name":"nvidia-dcgm-exporter"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3658545,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3879695,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.4111402,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.4461775,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.4819763,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.512175,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.525641,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"gpu-operator","name":"gpu-feature-discovery"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.52591,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.5471108,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.5630667,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.5940511,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.61711,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.6435907,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.673655,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-mig-parted-config","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.7027156,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.7207994,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"gpu-operator","name":"nvidia-mig-manager"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.720926,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.790162,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.8474965,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.905581,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.9758897,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403564.0551643,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403564.111743,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403564.1118217,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy isn't ready","states not ready":["state-container-toolkit","state-operator-validation","state-device-plugin","state-dcgm-exporter","gpu-feature-discovery"]}
Bec-k commented 1 year ago

All others pods are failing to start due this error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

They just stuck in PodInitializing state:

Failed to load logs: container "nvidia-device-plugin" in pod "nvidia-device-plugin-daemonset-hdrrv" is waiting to start: PodInitializing
Reason: BadRequest (400)
Bec-k commented 1 year ago

It seems that it is due to runtime toolkit is unsupported for Ubuntu 22.10, only Ubuntu 22.04. But where is that error, which states this and fails to install runtime?

Bec-k commented 1 year ago

Reverting back to Ubuntu 22.04, here is node creation, they are provisioning it with nvidia drivers: image

So there shouldn't be problems with gpu-operator... But gpu-operator is installing another driver for some reason. Here is nvidia-smi:

root@vultr:~# nvidia-smi
Tue Mar 21 13:36:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-4C       On   | 00000000:04:00.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Bec-k commented 1 year ago

Ok, so one problem was found, it's that docker image was trying to call /usr/bin/nvidia-toolkit instead of /usr/bin/nvidia-container-toolkit, so creating a link: ln -s /usr/bin/nvidia-container-toolkit /usr/bin/nvidia-toolkit

Helped to solve one problem, then after all validations have passed and all pods became green. I decided to restart all pods and it failed again on toolkit pod:

time="2023-03-21T14:45:44Z" level=info msg="Updating config"
time="2023-03-21T14:45:44Z" level=info msg="Successfully updated config"
time="2023-03-21T14:45:44Z" level=info msg="Flushing config"
time="2023-03-21T14:45:44Z" level=info msg="Successfully flushed config"
time="2023-03-21T14:45:44Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-03-21T14:45:44Z" level=info msg="Shutting Down"
time="2023-03-21T14:45:44Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: error running containerd command: signal: hangup"
Bec-k commented 1 year ago

After a few containerd restart and killing toolkit pod, i have managed to make it work... Very strange behavior...

time="2023-03-21T14:48:21Z" level=info msg="Setting up runtime"
time="2023-03-21T14:48:21Z" level=info msg="Starting 'setup' for containerd"
time="2023-03-21T14:48:21Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2023-03-21T14:48:21Z" level=info msg="Successfully parsed arguments"
time="2023-03-21T14:48:21Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
time="2023-03-21T14:48:21Z" level=info msg="Successfully loaded config"
time="2023-03-21T14:48:21Z" level=info msg="Config version: 2"
time="2023-03-21T14:48:21Z" level=info msg="Updating config"
time="2023-03-21T14:48:21Z" level=info msg="Successfully updated config"
time="2023-03-21T14:48:21Z" level=info msg="Flushing config"
time="2023-03-21T14:48:21Z" level=info msg="Successfully flushed config"
time="2023-03-21T14:48:21Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-03-21T14:48:21Z" level=info msg="Successfully signaled containerd"
time="2023-03-21T14:48:21Z" level=info msg="Completed 'setup' for containerd"
time="2023-03-21T14:48:21Z" level=info msg="Waiting for signal"
Bec-k commented 1 year ago

Why it is shutting down before these lines?

time="2023-03-21T14:48:21Z" level=info msg="Sending SIGHUP signal to containerd
time="2023-03-21T14:48:21Z" level=info msg="Successfully signaled containerd"
time="2023-03-21T14:48:21Z" level=info msg="Completed 'setup' for containerd"
time="2023-03-21T14:48:21Z" level=info msg="Waiting for signal"

like here:

time="2023-03-21T14:45:44Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-03-21T14:45:44Z" level=info msg="Shutting Down"
Bec-k commented 1 year ago

It failed to restart containerd? Why there is no error then? How "Successfully signaled containerd" is verified?

Bec-k commented 1 year ago

Ok, so linking issue is unrelated, it just can't restart containerd by sending SIGUP signal to it.

time="2023-03-21T14:45:44Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: error running containerd command: signal: hangup"
Bec-k commented 1 year ago

Node is using: containerd containerd.io 1.6.18 2456e983eb9e37e47538f59ea18f2043c9a73640

Bec-k commented 1 year ago

Ok, after digging into it, i have found it in sources: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/nvidia-toolkit/run.go#L250 and then actual restart attempt here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L321

It seems that it is not selecting systemd switch case, but is trying to signal it as if it is running as standalone daemon, without init system wrapper.

Bec-k commented 1 year ago

Because it is defaulted here https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L49

Bec-k commented 1 year ago

I don't see any setting in helm chart options, to specify of a method for containerd restart.

Bec-k commented 1 year ago

I see that there is an env variable for that option, which is called CONTAINERD_RESTART_MODE, but i don't see it in running container env, will try to modify daemon-set and see whether that it is forwarded into that toolkit. https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L161

Bec-k commented 1 year ago

Yeah... When i have added this env to a daemonset, it started working properly and without errors. Worth adding that into helm chat and set it to a default value, which matches the source code.

       - name: CONTAINERD_RESTART_MODE
          value: systemd

It would be much better if you try both variants or identify how containerd is started on that node. It shouldn't be that hard to identify, just query systemd for that service and its status, if both exist, then use systemd, else just pretend that there is no systemd or containerd systemd service and restart it as a standalone daemon.

Hope this helps everyone else, who will find the same problem in the future. Wasted around 5 hours to identify this problem...

shivamerla commented 1 year ago

@denissabramovs even when run as a systemd service, by default the toolkit container will kill the main containerd process for least disruption to shim processes. The issue you were seeing was known with containerd v1.6.9+ and handled in the operator version v22.9.2. I think you had created that issue as well :) Regarding driver installs, we install them using runfiles which will compiled and load modules. We are working on adding support for pre-compiled images, but that will not be available for all kernel variants but only for -generic.

sacashgit commented 1 year ago

@denissabramovs - Did you manage to install the driver using operator successfully or relied on the pre-installed driver on the node by disabling it ? I think you disabled the driver via operator but wanted to double check as I am facing the similar issue. (Thank you for the detailed updates, it is helping for sure).

wjentner commented 11 months ago

I just ran into a similar problem. For me, the driver was not installed at all. Checking the labels, there was one that said: nvidia.com/gpu.deploy.driver=pre-installed. After removing this label, the driver installation started and completed successfully.

gpu-operator v23.6.1
Ubuntu 22.04.3 LTS
containerd.io 1.6.22 8165feabfdfe38c65b599c4993d227328c231fca
Kubernetes v1.25.13