NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 287 forks source link

nvidia-driver-daemonset restart continuously #781

Open Hokwang opened 3 months ago

Hokwang commented 3 months ago

1. Quick Debug Information

2. Issue or feature description

nvidia-driver-daemonset not works

3. Steps to reproduce the issue

oneday one node has memory hang, so I reboot that server, and then nvidia-driver-daemonset did not change to running status.

4. Information to attach (optional if deemed irrelevant)

# kubectl -n gpu-operator get pod -l app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=gn10
NAME                            READY   STATUS    RESTARTS   AGE
nvidia-driver-daemonset-s4r9f   0/1     Running   0          85s

# kubectl -n gpu-operator get pod -l app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=gn10
NAME                            READY   STATUS        RESTARTS   AGE
nvidia-driver-daemonset-8d64z   0/1     Terminating   0          0s

when 0/1 Running, it's log is

# kubectl -n gpu-operator logs nvidia-driver-daemonset-4ggkq -f
nvidia driver modules are not yet loaded, invoking runc directly
+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=535.129.03
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ OPEN_KERNEL_MODULES_ENABLED=false
+ [[ false == \t\r\u\e ]]
+ KERNEL_TYPE=kernel
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
DRIVER_ARCH is x86_64
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=init
+ shift
+ case "${command}" in
++ getopt -l accept-license -o a --
+ options=' --'
+ '[' 0 -ne 0 ']'
+ eval set -- ' --'
++ set -- --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-553.5.1.el8_10.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ for opt in ${options}
+ case "$opt" in
+ shift
+ break
+ '[' 0 -ne 0 ']'
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
Resolving RHEL version...
+ echo 'Resolving RHEL version...'
+ local version=
++ cat /host-etc/os-release
++ grep '^ID='
++ awk -F= '{print $2}'
++ sed -e 's/^"//' -e 's/"$//'
+ local id=rhel
+ '[' rhel = rhcos ']'
+ '[' rhel = rhel ']'
++ cat /host-etc/os-release
++ grep VERSION_ID
++ awk -F= '{print $2}'
++ sed -e 's/^"//' -e 's/"$//'
+ version=8.10
+ '[' -z 8.10 ']'
+ RHEL_VERSION=8.10
+ echo 'Proceeding with RHEL version 8.10'
+ [[ -z '' ]]
+ DNF_RELEASEVER=8.10
+ return 0
Proceeding with RHEL version 8.10
+ init
+ _prepare_exclusive
+ _prepare
+ '[' passthrough = vgpu ']'
+ sh NVIDIA-Linux-x86_64-535.129.03.run -x
Creating directory NVIDIA-Linux-x86_64-535.129.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.129.03........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
+ cd NVIDIA-Linux-x86_64-535.129.03
+ sh /tmp/install.sh nvinstall
DRIVER_ARCH is x86_64

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.

WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.

WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.

WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.

+ mkdir -p /usr/src/nvidia-535.129.03
+ mv LICENSE mkprecompiled kernel /usr/src/nvidia-535.129.03
+ sed '9,${/^\(kernel\|LICENSE\)/!d}' .manifest
+ echo -e '\n========== NVIDIA Software Installer ==========\n'
+ echo -e 'Starting installation of NVIDIA driver version 535.129.03 for Linux kernel version 4.18.0-553.5.1.el8_10.x86_64\n'
+ exec
+ flock -n 3

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 535.129.03 for Linux kernel version 4.18.0-553.5.1.el8_10.x86_64

+ echo 581773
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
Stopping NVIDIA persistence daemon...
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ _build
+ _kernel_requires_package
+ local proc_mount_arg=
+ echo 'Checking NVIDIA driver packages...'
+ [[ ! -d /usr/src/nvidia-535.129.03/kernel ]]
+ cd /usr/src/nvidia-535.129.03/kernel
Checking NVIDIA driver packages...
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-553.5.1.el8_10.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
+ yum -q makecache
Updating the package cache...
+ _install_prerequisites
++ mktemp -d
+ local tmp_dir=/tmp/tmp.FvAUfm7jGQ
+ trap 'rm -rf /tmp/tmp.FvAUfm7jGQ' EXIT
+ cd /tmp/tmp.FvAUfm7jGQ
+ echo 'Installing elfutils...'
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
Installing elfutils...

Upgraded:
  elfutils-libelf-0.190-2.el8.x86_64      elfutils-libs-0.190-2.el8.x86_64
Installed:
  elfutils-debuginfod-client-0.190-2.el8.x86_64
  elfutils-libelf-devel-0.190-2.el8.x86_64
  libzstd-devel-1.4.4-1.el8.x86_64
  zlib-devel-1.2.11-21.el8_7.x86_64

+ rm -rf /lib/modules/4.18.0-553.5.1.el8_10.x86_64
+ mkdir -p /lib/modules/4.18.0-553.5.1.el8_10.x86_64/proc
Enabling RHOCP and EUS RPM repos...
+ echo 'Enabling RHOCP and EUS RPM repos...'
+ '[' -n '' ']'
+ dnf config-manager --set-enabled rhel-8-for-x86_64-baseos-eus-rpms
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
+ dnf makecache --releasever=8.10
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Red Hat Enterprise Linux 8 for x86_64 - BaseOS  2.0  B/s |  10  B     00:05
Errors during downloading metadata for repository 'rhel-8-for-x86_64-baseos-eus-rpms':
  - Status code: 404 for https://cdn.redhat.com/content/eus/rhel8/8.10/x86_64/baseos/os/repodata/repomd.xml (IP: 114.108.188.251)
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-eus-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
+ dnf config-manager --set-disabled rhel-8-for-x86_64-baseos-eus-rpms
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Installing Linux kernel headers...
+ echo 'Installing Linux kernel headers...'
+ dnf -q -y --releasever=8.10 install kernel-headers-4.18.0-553.5.1.el8_10.x86_64 kernel-devel-4.18.0-553.5.1.el8_10.x86_64

but every pod shows little different, sometimes pod shows

<snip>
+ cd /usr/src/nvidia-535.129.03/kernel
+ proc_mount_arg='--proc-mount-point /lib/modules/4.18.0-553.5.1.el8_10.x86_64/proc'
++ ls -d -1 'precompiled/**'
+ return 0
+ _update_package_cache
+ '[' '' '!=' builtin ']'
+ echo 'Updating the package cache...'
+ yum -q makecache
Updating the package cache...
++ echo 'Caught signal'
++ exit 1
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
+ local nvidia_peermem_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
Caught signal
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ '[' -f /var/run/nvidia-fabricmanager/nv-fabricmanager.pid ']'
Stopping NVIDIA persistence daemon...
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ '[' -f /sys/module/nvidia/refcnt ']'
+ '[' -f /sys/module/nvidia_peermem/refcnt ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ '[' 0 -gt 0 ']'
+ return 0
+ _unmount_rootfs
+ echo 'Unmounting NVIDIA driver rootfs...'
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
+ findmnt -r -o TARGET
+ grep /run/nvidia/driver
+ rm -f /run/nvidia/nvidia-driver.pid /run/kernel/postinst.d/update-nvidia-driver
+ return 0

and sometimes I can see Caught signal message after other commands output...

Hokwang commented 3 months ago

I am watching long time this daemonset.

# kubectl -n gpu-operator get pod -l app.kubernetes.io/component=nvidia-driver --field-selector spec.nodeName=gn10 -w
NAME                            READY   STATUS     RESTARTS   AGE
nvidia-driver-daemonset-jgnzk   0/1     Init:0/1   0          14s
nvidia-driver-daemonset-jgnzk   0/1     PodInitializing   0          35s
nvidia-driver-daemonset-jgnzk   0/1     Running           0          36s
nvidia-driver-daemonset-jgnzk   0/1     Terminating       0          81s
nvidia-driver-daemonset-jgnzk   0/1     Terminating       0          91s
nvidia-driver-daemonset-bd5g7   0/1     Pending           0          0s
nvidia-driver-daemonset-bd5g7   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-jgnzk   0/1     Terminating       0          92s
nvidia-driver-daemonset-jgnzk   0/1     Terminating       0          92s
nvidia-driver-daemonset-jgnzk   0/1     Terminating       0          92s
nvidia-driver-daemonset-bd5g7   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-bd5g7   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-bd5g7   0/1     Running           0          35s
nvidia-driver-daemonset-bd5g7   0/1     Terminating       0          86s
nvidia-driver-daemonset-bd5g7   0/1     Terminating       0          92s
nvidia-driver-daemonset-7hnkw   0/1     Pending           0          0s
nvidia-driver-daemonset-7hnkw   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-bd5g7   0/1     Terminating       0          93s
nvidia-driver-daemonset-bd5g7   0/1     Terminating       0          93s
nvidia-driver-daemonset-bd5g7   0/1     Terminating       0          93s
nvidia-driver-daemonset-7hnkw   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-7hnkw   0/1     Terminating       0          22s
nvidia-driver-daemonset-7hnkw   0/1     Terminating       0          22s
nvidia-driver-daemonset-7hnkw   0/1     Terminating       0          23s
nvidia-driver-daemonset-7hnkw   0/1     Terminating       0          23s
nvidia-driver-daemonset-7hnkw   0/1     Terminating       0          23s
nvidia-driver-daemonset-7hnkw   0/1     Terminating       0          23s
nvidia-driver-daemonset-nn2bd   0/1     Pending           0          0s
nvidia-driver-daemonset-nn2bd   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-nn2bd   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-nn2bd   0/1     PodInitializing   0          11s
nvidia-driver-daemonset-nn2bd   0/1     Running           0          12s
nvidia-driver-daemonset-nn2bd   0/1     Terminating       0          33s
nvidia-driver-daemonset-nn2bd   0/1     Terminating       0          63s
nvidia-driver-daemonset-whthp   0/1     Pending           0          0s
nvidia-driver-daemonset-whthp   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-nn2bd   0/1     Terminating       0          64s
nvidia-driver-daemonset-nn2bd   0/1     Terminating       0          64s
nvidia-driver-daemonset-nn2bd   0/1     Terminating       0          64s
nvidia-driver-daemonset-whthp   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-whthp   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-whthp   0/1     Running           0          35s
nvidia-driver-daemonset-whthp   0/1     Terminating       0          36s
nvidia-driver-daemonset-whthp   0/1     Terminating       0          37s
nvidia-driver-daemonset-xwpjv   0/1     Pending           0          0s
nvidia-driver-daemonset-xwpjv   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-whthp   0/1     Terminating       0          37s
nvidia-driver-daemonset-whthp   0/1     Terminating       0          37s
nvidia-driver-daemonset-whthp   0/1     Terminating       0          37s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          1s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          1s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          2s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          2s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          2s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          2s
nvidia-driver-daemonset-xwpjv   0/1     Terminating       0          2s
nvidia-driver-daemonset-qtvvz   0/1     Pending           0          0s
nvidia-driver-daemonset-qtvvz   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-qtvvz   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-qtvvz   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-qtvvz   0/1     Running           0          35s
nvidia-driver-daemonset-qtvvz   0/1     Terminating       0          98s
nvidia-driver-daemonset-qtvvz   0/1     Terminating       0          2m9s
nvidia-driver-daemonset-2wgn4   0/1     Pending           0          0s
nvidia-driver-daemonset-2wgn4   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-qtvvz   0/1     Terminating       0          2m9s
nvidia-driver-daemonset-qtvvz   0/1     Terminating       0          2m9s
nvidia-driver-daemonset-qtvvz   0/1     Terminating       0          2m9s
nvidia-driver-daemonset-2wgn4   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-2wgn4   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-2wgn4   0/1     Running           0          35s
nvidia-driver-daemonset-2wgn4   0/1     Terminating       0          37s
nvidia-driver-daemonset-2wgn4   0/1     Terminating       0          38s
nvidia-driver-daemonset-zw46q   0/1     Pending           0          0s
nvidia-driver-daemonset-zw46q   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-2wgn4   0/1     Terminating       0          39s
nvidia-driver-daemonset-2wgn4   0/1     Terminating       0          39s
nvidia-driver-daemonset-2wgn4   0/1     Terminating       0          39s
nvidia-driver-daemonset-zw46q   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-zw46q   0/1     Terminating       0          26s
nvidia-driver-daemonset-zw46q   0/1     Terminating       0          27s
nvidia-driver-daemonset-zw46q   0/1     Terminating       0          27s
nvidia-driver-daemonset-zw46q   0/1     Terminating       0          27s
nvidia-driver-daemonset-zw46q   0/1     Terminating       0          27s
nvidia-driver-daemonset-zw46q   0/1     Terminating       0          27s
nvidia-driver-daemonset-j5l4x   0/1     Pending           0          0s
nvidia-driver-daemonset-j5l4x   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          0s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          2s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          2s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          3s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          3s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          3s
nvidia-driver-daemonset-j5l4x   0/1     Terminating       0          3s
nvidia-driver-daemonset-l4qfh   0/1     Pending           0          0s
nvidia-driver-daemonset-l4qfh   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-l4qfh   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-l4qfh   0/1     PodInitializing   0          4s
nvidia-driver-daemonset-l4qfh   0/1     Running           0          5s
nvidia-driver-daemonset-l4qfh   0/1     Terminating       0          11s
nvidia-driver-daemonset-l4qfh   0/1     Terminating       0          19s
nvidia-driver-daemonset-jcgcr   0/1     Pending           0          0s
nvidia-driver-daemonset-jcgcr   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-l4qfh   0/1     Terminating       0          20s
nvidia-driver-daemonset-l4qfh   0/1     Terminating       0          20s
nvidia-driver-daemonset-l4qfh   0/1     Terminating       0          20s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          1s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          2s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          2s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          3s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          3s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          3s
nvidia-driver-daemonset-jcgcr   0/1     Terminating       0          3s
nvidia-driver-daemonset-mnst5   0/1     Pending           0          0s
nvidia-driver-daemonset-mnst5   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          0s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          1s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          1s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          2s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          2s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          2s
nvidia-driver-daemonset-mnst5   0/1     Terminating       0          2s
nvidia-driver-daemonset-4k7wb   0/1     Pending           0          0s
nvidia-driver-daemonset-4k7wb   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4k7wb   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-4k7wb   0/1     Terminating       0          16s
nvidia-driver-daemonset-4k7wb   0/1     Terminating       0          16s
nvidia-driver-daemonset-4k7wb   0/1     Terminating       0          17s
nvidia-driver-daemonset-4k7wb   0/1     Terminating       0          17s
nvidia-driver-daemonset-4k7wb   0/1     Terminating       0          17s
nvidia-driver-daemonset-4k7wb   0/1     Terminating       0          17s
nvidia-driver-daemonset-4rmhj   0/1     Pending           0          0s
nvidia-driver-daemonset-4rmhj   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4rmhj   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-4rmhj   0/1     PodInitializing   0          17s
nvidia-driver-daemonset-4rmhj   0/1     Terminating       0          17s
nvidia-driver-daemonset-4rmhj   0/1     Terminating       0          18s
nvidia-driver-daemonset-4rmhj   0/1     Terminating       0          18s
nvidia-driver-daemonset-s7j8x   0/1     Pending           0          0s
nvidia-driver-daemonset-s7j8x   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4rmhj   0/1     Terminating       0          19s
nvidia-driver-daemonset-4rmhj   0/1     Terminating       0          19s
nvidia-driver-daemonset-4rmhj   0/1     Terminating       0          19s
nvidia-driver-daemonset-s7j8x   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-s7j8x   0/1     Terminating       0          4s
nvidia-driver-daemonset-s7j8x   0/1     Terminating       0          4s
nvidia-driver-daemonset-s7j8x   0/1     Terminating       0          5s
nvidia-driver-daemonset-s7j8x   0/1     Terminating       0          5s
nvidia-driver-daemonset-s7j8x   0/1     Terminating       0          5s
nvidia-driver-daemonset-s7j8x   0/1     Terminating       0          5s
nvidia-driver-daemonset-5hrpq   0/1     Pending           0          0s
nvidia-driver-daemonset-5hrpq   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-5hrpq   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-5hrpq   0/1     PodInitializing   0          30s
nvidia-driver-daemonset-5hrpq   0/1     Running           0          31s
nvidia-driver-daemonset-5hrpq   0/1     Terminating       0          105s
nvidia-driver-daemonset-5hrpq   0/1     Terminating       0          2m16s
nvidia-driver-daemonset-5xmh8   0/1     Pending           0          0s
nvidia-driver-daemonset-5xmh8   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-5hrpq   0/1     Terminating       0          2m16s
nvidia-driver-daemonset-5hrpq   0/1     Terminating       0          2m16s
nvidia-driver-daemonset-5hrpq   0/1     Terminating       0          2m16s
nvidia-driver-daemonset-5xmh8   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-5xmh8   0/1     Terminating       0          25s
nvidia-driver-daemonset-5xmh8   0/1     Terminating       0          25s
nvidia-driver-daemonset-5xmh8   0/1     Terminating       0          25s
nvidia-driver-daemonset-5xmh8   0/1     Terminating       0          26s
nvidia-driver-daemonset-5xmh8   0/1     Terminating       0          26s
nvidia-driver-daemonset-5xmh8   0/1     Terminating       0          26s
nvidia-driver-daemonset-4z46z   0/1     Pending           0          0s
nvidia-driver-daemonset-4z46z   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4z46z   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-4z46z   0/1     Terminating       0          7s
nvidia-driver-daemonset-4z46z   0/1     Terminating       0          7s
nvidia-driver-daemonset-4z46z   0/1     Terminating       0          7s
nvidia-driver-daemonset-4z46z   0/1     Terminating       0          7s
nvidia-driver-daemonset-4z46z   0/1     Terminating       0          7s
nvidia-driver-daemonset-4z46z   0/1     Terminating       0          7s
nvidia-driver-daemonset-hmvhx   0/1     Pending           0          0s
nvidia-driver-daemonset-hmvhx   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-hmvhx   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-hmvhx   0/1     PodInitializing   0          3s
nvidia-driver-daemonset-hmvhx   0/1     Running           0          4s
nvidia-driver-daemonset-hmvhx   0/1     Terminating       0          93s
nvidia-driver-daemonset-hmvhx   0/1     Terminating       0          2m5s
nvidia-driver-daemonset-m5jt6   0/1     Pending           0          0s
nvidia-driver-daemonset-m5jt6   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-hmvhx   0/1     Terminating       0          2m6s
nvidia-driver-daemonset-hmvhx   0/1     Terminating       0          2m6s
nvidia-driver-daemonset-hmvhx   0/1     Terminating       0          2m6s
nvidia-driver-daemonset-m5jt6   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-m5jt6   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-m5jt6   0/1     Running           0          35s
nvidia-driver-daemonset-m5jt6   0/1     Terminating       0          58s
nvidia-driver-daemonset-m5jt6   0/1     Terminating       0          90s
nvidia-driver-daemonset-5z7qw   0/1     Pending           0          0s
nvidia-driver-daemonset-5z7qw   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-m5jt6   0/1     Terminating       0          90s
nvidia-driver-daemonset-m5jt6   0/1     Terminating       0          90s
nvidia-driver-daemonset-m5jt6   0/1     Terminating       0          90s
nvidia-driver-daemonset-5z7qw   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-5z7qw   0/1     Terminating       0          14s
nvidia-driver-daemonset-5z7qw   0/1     Terminating       0          14s
nvidia-driver-daemonset-5z7qw   0/1     Terminating       0          15s
nvidia-driver-daemonset-5z7qw   0/1     Terminating       0          15s
nvidia-driver-daemonset-5z7qw   0/1     Terminating       0          15s
nvidia-driver-daemonset-5z7qw   0/1     Terminating       0          15s
nvidia-driver-daemonset-z4756   0/1     Pending           0          0s
nvidia-driver-daemonset-z4756   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-z4756   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-z4756   0/1     PodInitializing   0          19s
nvidia-driver-daemonset-z4756   0/1     Running           0          20s
nvidia-driver-daemonset-z4756   0/1     Terminating       0          2m25s
nvidia-driver-daemonset-z4756   0/1     Terminating       0          2m55s
nvidia-driver-daemonset-5789r   0/1     Pending           0          0s
nvidia-driver-daemonset-5789r   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-z4756   0/1     Terminating       0          2m56s
nvidia-driver-daemonset-z4756   0/1     Terminating       0          2m56s
nvidia-driver-daemonset-z4756   0/1     Terminating       0          2m56s
nvidia-driver-daemonset-5789r   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-5789r   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-5789r   0/1     Running           0          35s
nvidia-driver-daemonset-5789r   0/1     Terminating       0          93s
nvidia-driver-daemonset-5789r   0/1     Terminating       0          2m4s
nvidia-driver-daemonset-sxwhc   0/1     Pending           0          0s
nvidia-driver-daemonset-sxwhc   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-5789r   0/1     Terminating       0          2m5s
nvidia-driver-daemonset-5789r   0/1     Terminating       0          2m5s
nvidia-driver-daemonset-5789r   0/1     Terminating       0          2m5s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          1s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          2s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          2s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          3s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          3s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          3s
nvidia-driver-daemonset-sxwhc   0/1     Terminating       0          3s
nvidia-driver-daemonset-6z9lz   0/1     Pending           0          0s
nvidia-driver-daemonset-6z9lz   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-6z9lz   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-6z9lz   0/1     Terminating       0          2s
nvidia-driver-daemonset-6z9lz   0/1     Terminating       0          4s
nvidia-driver-daemonset-6z9lz   0/1     Terminating       0          5s
nvidia-driver-daemonset-6z9lz   0/1     Terminating       0          5s
nvidia-driver-daemonset-6z9lz   0/1     Terminating       0          5s
nvidia-driver-daemonset-6z9lz   0/1     Terminating       0          5s
nvidia-driver-daemonset-btppt   0/1     Pending           0          0s
nvidia-driver-daemonset-btppt   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-btppt   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-btppt   0/1     PodInitializing   0          29s
nvidia-driver-daemonset-btppt   0/1     Running           0          30s
nvidia-driver-daemonset-btppt   0/1     Terminating       0          2m43s
nvidia-driver-daemonset-btppt   0/1     Terminating       0          3m15s
nvidia-driver-daemonset-s7z9t   0/1     Pending           0          0s
nvidia-driver-daemonset-s7z9t   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-btppt   0/1     Terminating       0          3m16s
nvidia-driver-daemonset-btppt   0/1     Terminating       0          3m16s
nvidia-driver-daemonset-btppt   0/1     Terminating       0          3m16s
nvidia-driver-daemonset-s7z9t   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-s7z9t   0/1     Terminating       0          33s
nvidia-driver-daemonset-s7z9t   0/1     Terminating       0          34s
nvidia-driver-daemonset-s7z9t   0/1     Terminating       0          34s
nvidia-driver-daemonset-s7z9t   0/1     Terminating       0          34s
nvidia-driver-daemonset-s7z9t   0/1     Terminating       0          34s
nvidia-driver-daemonset-s7z9t   0/1     Terminating       0          34s
nvidia-driver-daemonset-5sccm   0/1     Pending           0          0s
nvidia-driver-daemonset-5sccm   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-5sccm   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-5sccm   0/1     PodInitializing   0          3s
nvidia-driver-daemonset-5sccm   0/1     Terminating       0          4s
nvidia-driver-daemonset-5sccm   0/1     Terminating       0          4s
nvidia-driver-daemonset-5sccm   0/1     Terminating       0          5s
nvidia-driver-daemonset-sdc7k   0/1     Pending           0          0s
nvidia-driver-daemonset-sdc7k   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-5sccm   0/1     Terminating       0          6s
nvidia-driver-daemonset-5sccm   0/1     Terminating       0          6s
nvidia-driver-daemonset-5sccm   0/1     Terminating       0          6s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          1s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          2s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          2s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          3s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          3s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          3s
nvidia-driver-daemonset-sdc7k   0/1     Terminating       0          3s
nvidia-driver-daemonset-fp5l2   0/1     Pending           0          0s
nvidia-driver-daemonset-fp5l2   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-fp5l2   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-fp5l2   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-fp5l2   0/1     Running           0          35s
nvidia-driver-daemonset-fp5l2   0/1     Terminating       0          70s
nvidia-driver-daemonset-fp5l2   0/1     Terminating       0          90s
nvidia-driver-daemonset-dwcn5   0/1     Pending           0          0s
nvidia-driver-daemonset-dwcn5   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-fp5l2   0/1     Terminating       0          90s
nvidia-driver-daemonset-fp5l2   0/1     Terminating       0          90s
nvidia-driver-daemonset-fp5l2   0/1     Terminating       0          90s
nvidia-driver-daemonset-dwcn5   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-dwcn5   0/1     Terminating       0          3s
nvidia-driver-daemonset-dwcn5   0/1     Terminating       0          3s
nvidia-driver-daemonset-dwcn5   0/1     Terminating       0          4s
nvidia-driver-daemonset-dwcn5   0/1     Terminating       0          4s
nvidia-driver-daemonset-dwcn5   0/1     Terminating       0          4s
nvidia-driver-daemonset-dwcn5   0/1     Terminating       0          4s
nvidia-driver-daemonset-x9wr6   0/1     Pending           0          0s
nvidia-driver-daemonset-x9wr6   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-x9wr6   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-x9wr6   0/1     PodInitializing   0          29s
nvidia-driver-daemonset-x9wr6   0/1     Running           0          30s
nvidia-driver-daemonset-x9wr6   0/1     Terminating       0          3m4s
nvidia-driver-daemonset-x9wr6   0/1     Terminating       0          3m30s
nvidia-driver-daemonset-b44rq   0/1     Pending           0          0s
nvidia-driver-daemonset-b44rq   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-x9wr6   0/1     Terminating       0          3m30s
nvidia-driver-daemonset-x9wr6   0/1     Terminating       0          3m30s
nvidia-driver-daemonset-x9wr6   0/1     Terminating       0          3m30s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          0s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          1s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          2s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          2s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          2s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          2s
nvidia-driver-daemonset-b44rq   0/1     Terminating       0          2s
nvidia-driver-daemonset-9sgg2   0/1     Pending           0          0s
nvidia-driver-daemonset-9sgg2   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-9sgg2   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-9sgg2   0/1     PodInitializing   0          34s
nvidia-driver-daemonset-9sgg2   0/1     Running           0          35s
nvidia-driver-daemonset-9sgg2   0/1     Terminating       0          84s
nvidia-driver-daemonset-9sgg2   0/1     Terminating       0          90s
nvidia-driver-daemonset-d476x   0/1     Pending           0          0s
nvidia-driver-daemonset-d476x   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-9sgg2   0/1     Terminating       0          91s
nvidia-driver-daemonset-9sgg2   0/1     Terminating       0          91s
nvidia-driver-daemonset-9sgg2   0/1     Terminating       0          91s
nvidia-driver-daemonset-d476x   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-d476x   0/1     Terminating       0          11s
nvidia-driver-daemonset-d476x   0/1     Terminating       0          11s
nvidia-driver-daemonset-d476x   0/1     Terminating       0          11s
nvidia-driver-daemonset-d476x   0/1     Terminating       0          11s
nvidia-driver-daemonset-d476x   0/1     Terminating       0          11s
nvidia-driver-daemonset-d476x   0/1     Terminating       0          11s
nvidia-driver-daemonset-4kwk4   0/1     Pending           0          0s
nvidia-driver-daemonset-4kwk4   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          0s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          1s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          1s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          2s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          2s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          2s
nvidia-driver-daemonset-4kwk4   0/1     Terminating       0          2s
nvidia-driver-daemonset-lrv77   0/1     Pending           0          0s
nvidia-driver-daemonset-lrv77   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-lrv77   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-lrv77   0/1     PodInitializing   0          21s
nvidia-driver-daemonset-lrv77   0/1     Running           0          22s
nvidia-driver-daemonset-lrv77   0/1     Terminating       0          54s
nvidia-driver-daemonset-lrv77   0/1     Terminating       0          78s
nvidia-driver-daemonset-klrb7   0/1     Pending           0          0s
nvidia-driver-daemonset-klrb7   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-lrv77   0/1     Terminating       0          78s
nvidia-driver-daemonset-lrv77   0/1     Terminating       0          78s
nvidia-driver-daemonset-lrv77   0/1     Terminating       0          78s
nvidia-driver-daemonset-klrb7   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-klrb7   0/1     Terminating       0          4s
nvidia-driver-daemonset-klrb7   0/1     Terminating       0          4s
nvidia-driver-daemonset-klrb7   0/1     Terminating       0          4s
nvidia-driver-daemonset-klrb7   0/1     Terminating       0          5s
nvidia-driver-daemonset-klrb7   0/1     Terminating       0          5s
nvidia-driver-daemonset-klrb7   0/1     Terminating       0          5s
nvidia-driver-daemonset-sc7td   0/1     Pending           0          0s
nvidia-driver-daemonset-sc7td   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          0s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          1s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          2s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          2s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          2s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          2s
nvidia-driver-daemonset-sc7td   0/1     Terminating       0          2s
nvidia-driver-daemonset-gzqdr   0/1     Pending           0          0s
nvidia-driver-daemonset-gzqdr   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-gzqdr   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-gzqdr   0/1     Terminating       0          24s
nvidia-driver-daemonset-gzqdr   0/1     Terminating       0          25s
nvidia-driver-daemonset-gzqdr   0/1     Terminating       0          25s
nvidia-driver-daemonset-gzqdr   0/1     Terminating       0          25s
nvidia-driver-daemonset-gzqdr   0/1     Terminating       0          25s
nvidia-driver-daemonset-gzqdr   0/1     Terminating       0          25s
nvidia-driver-daemonset-sjbjk   0/1     Pending           0          0s
nvidia-driver-daemonset-sjbjk   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-sjbjk   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-sjbjk   0/1     PodInitializing   0          3s
nvidia-driver-daemonset-sjbjk   0/1     Running           0          4s
nvidia-driver-daemonset-sjbjk   0/1     Terminating       0          33s
nvidia-driver-daemonset-sjbjk   0/1     Terminating       0          59s
nvidia-driver-daemonset-g9jgm   0/1     Pending           0          0s
nvidia-driver-daemonset-g9jgm   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-sjbjk   0/1     Terminating       0          59s
nvidia-driver-daemonset-sjbjk   0/1     Terminating       0          59s
nvidia-driver-daemonset-sjbjk   0/1     Terminating       0          59s
nvidia-driver-daemonset-g9jgm   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-g9jgm   0/1     Terminating       0          34s
nvidia-driver-daemonset-g9jgm   0/1     Terminating       0          35s
nvidia-driver-daemonset-g9jgm   0/1     Terminating       0          35s
nvidia-driver-daemonset-g9jgm   0/1     Terminating       0          35s
nvidia-driver-daemonset-g9jgm   0/1     Terminating       0          36s
nvidia-driver-daemonset-g9jgm   0/1     Terminating       0          36s
nvidia-driver-daemonset-bh4rj   0/1     Pending           0          0s
nvidia-driver-daemonset-bh4rj   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-bh4rj   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-bh4rj   0/1     Terminating       0          2s
nvidia-driver-daemonset-bh4rj   0/1     Terminating       0          4s
nvidia-driver-daemonset-bh4rj   0/1     Terminating       0          4s
nvidia-driver-daemonset-bh4rj   0/1     Terminating       0          5s
nvidia-driver-daemonset-bh4rj   0/1     Terminating       0          5s
nvidia-driver-daemonset-bh4rj   0/1     Terminating       0          5s
nvidia-driver-daemonset-4p8zh   0/1     Pending           0          0s
nvidia-driver-daemonset-4p8zh   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4p8zh   0/1     Init:0/1          0          1s
nvidia-driver-daemonset-4p8zh   0/1     Terminating       0          18s
nvidia-driver-daemonset-4p8zh   0/1     Terminating       0          18s
nvidia-driver-daemonset-4p8zh   0/1     Terminating       0          19s
nvidia-driver-daemonset-4p8zh   0/1     Terminating       0          19s
nvidia-driver-daemonset-4p8zh   0/1     Terminating       0          19s
nvidia-driver-daemonset-4p8zh   0/1     Terminating       0          19s
nvidia-driver-daemonset-4sxd5   0/1     Pending           0          0s
nvidia-driver-daemonset-4sxd5   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4sxd5   0/1     Init:0/1          0          2s
nvidia-driver-daemonset-4sxd5   0/1     PodInitializing   0          10s
nvidia-driver-daemonset-4sxd5   0/1     Running           0          11s
nvidia-driver-daemonset-4sxd5   0/1     Terminating       0          110s
nvidia-driver-daemonset-4sxd5   0/1     Terminating       0          2m23s
nvidia-driver-daemonset-4sxd5   0/1     Terminating       0          2m23s
nvidia-driver-daemonset-k69dh   0/1     Pending           0          0s
nvidia-driver-daemonset-k69dh   0/1     Init:0/1          0          0s
nvidia-driver-daemonset-4sxd5   0/1     Terminating       0          2m23s
nvidia-driver-daemonset-4sxd5   0/1     Terminating       0          2m23s
nvidia-driver-daemonset-k69dh   0/1     Init:0/1          0          2s
cdesiniotis commented 2 months ago

From the driver container logs, we see the following error message

Error: Failed to download metadata for repo 'rhel-8-for-x86_64-baseos-eus-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

Can you confirm your cluster networking is healthy?

Hokwang commented 2 months ago

@cdesiniotis Hi, yes, except that *-eus-* rpm, everything is fine, is it needed? and I don't know why server can not download metadata for -eus- only? url is same with other rpms.

aplufr commented 2 months ago

Hello,

One of our node did run with the same issue while doing a driver upgrade. Upgrade went well for 5 servers but one of them, the nvidia-driver-daemonset pod are restarting to fast and I can not catch logs.

It looks like we are stuck in "pod-restart-required" from looking on labels described here https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/24.3.0/_images/upgrade-controller-state-machine.png All nodes do have correct version of the driver deployed but they are still showing "upgrade-required" and the node being proceed with driver ds restarter is flagged as "pod-restart-required".

We didn't had any issue with "535.161.08" but with "535.183.06" that does not work as expected, operator 24.3.0.

I've disable the auto upgrade flags from helm config, the driver is still deployed but at least node are running okay (another way is to label the node with nvidia.com/gpu-driver-upgrade.skip=true).