Open Bec-k opened 1 year ago
Using docker image: nvcr.io/nvidia/driver:515.65.01-ubuntu22.04
And nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2
Using operator gpu-operator-v22.9.0
Upgraded gpu-operator to the latest version v22.9.2
, waiting for it to be compiled.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr ========== NVIDIA Software Installer ==========
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Checking NVIDIA driver packages...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Updating the package cache...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Resolving Linux kernel version...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Proceeding with Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel headers...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel module files...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Generating Linux kernel version string...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Compiling NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr warning: the compiler differs from the one used to build the kernel
nvidia-driver-daemonset-tgncp nvidia-driver-ctr The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-tgncp nvidia-driver-ctr You are using: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Shouldn't it be installed without compiling?
Compilation is kinda overwhelming for a smaller GPU nodes:
Swap is used, it's aroun 12gb memory used to compile. That's a lot, should i be like that? I know that it's not gpu-operator project code, just asking around. Looks like something is off there.
It is failing as well, here is full log of driver manager and driver installer:
+ nvidia-driver-daemonset-tgncp › nvidia-driver-ctr
+ nvidia-driver-daemonset-tgncp › k8s-driver-manager
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm=true'
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.mig-manager='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
nvidia-driver-daemonset-tgncp k8s-driver-manager Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of 'nodeType='
nvidia-driver-daemonset-tgncp k8s-driver-manager Current value of AUTO_UPGRADE_POLICY_ENABLED='
nvidia-driver-daemonset-tgncp k8s-driver-manager Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 labeled
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager pod/nvidia-operator-validator-jwq5h condition met
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for the container-toolkit to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager pod/nvidia-container-toolkit-daemonset-g4hd6 condition met
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for the device-plugin to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for gpu-feature-discovery to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for dcgm-exporter to shutdown
nvidia-driver-daemonset-tgncp k8s-driver-manager Waiting for dcgm to shutdown
nvidia-driver-daemonset-tgncp nvidia-driver-ctr DRIVER_ARCH is x86_64
nvidia-driver-daemonset-tgncp k8s-driver-manager Auto upgrade policy of the GPU driver on the node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 is disabled
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Creating directory NVIDIA-Linux-x86_64-525.60.13
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Verifying archive integrity... OK
nvidia-driver-daemonset-tgncp k8s-driver-manager Cordoning node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7...
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 cordoned
nvidia-driver-daemonset-tgncp k8s-driver-manager Draining node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 of any GPU pods...
nvidia-driver-daemonset-tgncp k8s-driver-manager W0321 11:51:52.622197 20244 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
nvidia-driver-daemonset-tgncp k8s-driver-manager time="2023-03-21T11:51:52Z" level=info msg="Identifying GPU pods to delete"
nvidia-driver-daemonset-tgncp k8s-driver-manager time="2023-03-21T11:51:52Z" level=info msg="No GPU pods to delete. Exiting."
nvidia-driver-daemonset-tgncp k8s-driver-manager unbinding device 0000:04:00.0
nvidia-driver-daemonset-tgncp k8s-driver-manager Auto upgrade policy of the GPU driver on the node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 is disabled
nvidia-driver-daemonset-tgncp k8s-driver-manager Uncordoning node scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7...
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 uncordoned
nvidia-driver-daemonset-tgncp k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
nvidia-driver-daemonset-tgncp k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-2ffab7 labeled
- nvidia-driver-daemonset-tgncp › k8s-driver-manager
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.60.13...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation. Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr ========== NVIDIA Software Installer ==========
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Starting installation of NVIDIA driver version 525.60.13 for Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Checking NVIDIA driver packages...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Updating the package cache...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Resolving Linux kernel version...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Proceeding with Linux kernel version 5.15.0-67-generic
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel headers...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Installing Linux kernel module files...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Generating Linux kernel version string...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Compiling NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr warning: the compiler differs from the one used to build the kernel
nvidia-driver-daemonset-tgncp nvidia-driver-ctr The kernel was built by: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-tgncp nvidia-driver-ctr You are using: cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 445 | int status = 0;
nvidia-driver-daemonset-tgncp nvidia-driver-ctr | ^~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia/nv.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** Waiting for unfinished jobs....
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hopper_mmu.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hopper_host.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia/os-registry.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia/nv-mmap.c: In function 'nv_encode_caching':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia/nv-mmap.c:353:16: warning: this statement may fall through [-Wimplicit-fallthrough=]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 353 | if (NV_ALLOW_CACHING(memory_type))
nvidia-driver-daemonset-tgncp nvidia-driver-ctr | ^
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia/nv-mmap.c:356:9: note: here
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 356 | default:
nvidia-driver-daemonset-tgncp nvidia-driver-ctr | ^~~~~~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia/nv-caps.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_tools.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_procfs.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hopper_ce.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:298: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_hal.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_va_range.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_va_space.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr cc: fatal error: Killed signal terminated program cc1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr compilation terminated.
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[2]: *** [scripts/Makefile.build:297: /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel.o] Error 1
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-crtc.c: In function '__nv_drm_plane_atomic_destroy_state':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-drm/nvidia-drm-crtc.c:678:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 678 | struct nv_drm_plane_state *nv_drm_plane_state =
nvidia-driver-daemonset-tgncp nvidia-driver-ctr | ^~~~~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel_test.c: In function 'test_unexpected_completed_values':
nvidia-driver-daemonset-tgncp nvidia-driver-ctr /usr/src/nvidia-525.60.13/kernel/nvidia-uvm/uvm_channel_test.c:156:15: warning: unused variable 'status' [-Wunused-variable]
nvidia-driver-daemonset-tgncp nvidia-driver-ctr 156 | NV_STATUS status;
nvidia-driver-daemonset-tgncp nvidia-driver-ctr | ^~~~~~
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make[1]: *** [Makefile:1906: /usr/src/nvidia-525.60.13/kernel] Error 2
nvidia-driver-daemonset-tgncp nvidia-driver-ctr make: *** [Makefile:82: modules] Error 2
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-tgncp nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
It is retrying that installation and always failing.
Will try to install it on Ubuntu latest non-LTS
ok, interesting, but with Ubuntu 22.10 it worked:
nvidia-driver-daemonset-x48wf k8s-driver-manager Tue Mar 21 12:53:42 2023
nvidia-driver-daemonset-x48wf k8s-driver-manager +-----------------------------------------------------------------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager | NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
nvidia-driver-daemonset-x48wf k8s-driver-manager |-------------------------------+----------------------+----------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
nvidia-driver-daemonset-x48wf k8s-driver-manager | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
nvidia-driver-daemonset-x48wf k8s-driver-manager | | | MIG M. |
nvidia-driver-daemonset-x48wf k8s-driver-manager |===============================+======================+======================|
nvidia-driver-daemonset-x48wf k8s-driver-manager | 0 GRID A100D-4C On | 00000000:04:00.0 Off | 0 |
nvidia-driver-daemonset-x48wf k8s-driver-manager | N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
nvidia-driver-daemonset-x48wf k8s-driver-manager | | | Disabled |
nvidia-driver-daemonset-x48wf k8s-driver-manager +-------------------------------+----------------------+----------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager
nvidia-driver-daemonset-x48wf k8s-driver-manager +-----------------------------------------------------------------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager | Processes: |
nvidia-driver-daemonset-x48wf k8s-driver-manager | GPU GI CI PID Type Process name GPU Memory |
nvidia-driver-daemonset-x48wf k8s-driver-manager | ID ID Usage |
nvidia-driver-daemonset-x48wf k8s-driver-manager |=============================================================================|
nvidia-driver-daemonset-x48wf k8s-driver-manager | No running processes found |
nvidia-driver-daemonset-x48wf k8s-driver-manager +-----------------------------------------------------------------------------+
nvidia-driver-daemonset-x48wf k8s-driver-manager NVIDIA GPU driver is already pre-installed on the node, disabling the containerized driver on the node
nvidia-driver-daemonset-x48wf k8s-driver-manager node/scw-k8s-suspicious-yona-pool-sad-lovela-67942f labeled
Even without driver installation, it was already there. I have checked Vultr logs when node was creating, they are pre-installing drivers before letting in. It seems that there was or is some problem with others Ubuntu distros and drivers versions etc.
There is now another problem, now with toolkit:
+ nvidia-container-toolkit-daemonset-zk8cn › nvidia-container-toolkit-ctr
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Starting nvidia-toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Parsing arguments"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Verifying Flags"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg=Initializing
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Removing existing NVIDIA container toolkit installation"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container.so.1': error resolving link '/usr/lib64/libnvidia-container.so.1': lstat /usr/lib64/libnvidia-container.so.1: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-container-go.so.1': error resolving link '/usr/lib64/libnvidia-container-go.so.1': lstat /usr/lib64/libnvidia-container-go.so.1: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Resolved link: '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' => '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.11.0'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Finding library libnvidia-ml.so (root=/)"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib64/libnvidia-ml.so': error resolving link '/usr/lib64/libnvidia-ml.so': lstat /usr/lib64/libnvidia-ml.so: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so': error resolving link '/usr/lib/x86_64-linux-gnu/libnvidia-ml.so': lstat /usr/lib/x86_64-linux-gnu/libnvidia-ml.so: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Checking library candidate '/usr/lib/aarch64-linux-gnu/libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Skipping library candidate '/usr/lib/aarch64-linux-gnu/libnvidia-ml.so': error resolving link '/usr/lib/aarch64-linux-gnu/libnvidia-ml.so': lstat /usr/lib/aarch64-linux-gnu: no such file or directory"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=warning msg="Error finding library path for root /: error locating NVIDIA management library: error locating library 'libnvidia-ml.so'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Using library root "
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr Using config:
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr accept-nvidia-visible-devices-as-volume-mounts = false
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr accept-nvidia-visible-devices-envvar-when-unprivileged = true
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr disable-require = false
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr [nvidia-container-cli]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr environment = []
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr ldconfig = "@/sbin/ldconfig.real"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr load-kmods = true
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr root = "/"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr [nvidia-container-runtime]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr log-level = "info"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr mode = "auto"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr runtimes = ["docker-runc", "runc"]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr [nvidia-container-runtime.modes]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr [nvidia-container-runtime.modes.csv]
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Setting up runtime"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Starting 'setup' for containerd"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully parsed arguments"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully loaded config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Config version: 2"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Updating config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully updated config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Flushing config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Successfully flushed config"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=info msg="Shutting Down"
nvidia-container-toolkit-daemonset-zk8cn nvidia-container-toolkit-ctr time="2023-03-21T12:57:11Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: error running containerd command: signal: hangup"
Logs from gpu-operator pod:
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7050278,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7056942,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yona-pool-sad-lovela-67942f","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7057347,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"scw-k8s-suspicious-yona-pool-sad-lovela-67942f"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.705751,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yona-pool-sad-lovela-e8d8cd","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.705805,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yonath-default-d556e605e276","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7060485,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"scw-k8s-suspicious-yon-pool-quizzical-j-fe9338","GpuWorkloadConfig":"container"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7060602,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7066238,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7068782,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7171223,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.717339,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7325332,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7460692,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.7744899,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8070672,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8314767,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8554163,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-driver","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.870571,"logger":"controllers.ClusterPolicy","msg":"5.19.0-29-generic","Request.Namespace":"default","Request.Name":"Node"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8736198,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-driver-daemonset","Namespace":"gpu-operator","name":"nvidia-driver-daemonset"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.874162,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.8898356,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-container-toolkit","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.908736,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-container-toolkit","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9388857,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-container-toolkit","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9490485,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-container-toolkit-daemonset","Namespace":"gpu-operator","name":"nvidia-container-toolkit-daemonset"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9498417,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.9706805,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403562.991417,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.017446,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.0417135,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.0707805,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.0964117,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"gpu-operator","name":"nvidia-operator-validator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.096793,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1137688,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1315594,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1568701,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.1907294,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2151012,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2266555,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"gpu-operator","name":"nvidia-device-plugin-daemonset"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2272947,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.2803714,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.295747,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3126926,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3409662,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3533444,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.365767,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"gpu-operator","name":"nvidia-dcgm-exporter"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3658545,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.3879695,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.4111402,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.4461775,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.4819763,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.512175,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.525641,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"gpu-operator","name":"gpu-feature-discovery"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.52591,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.5471108,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.5630667,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.5940511,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.61711,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.6435907,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.673655,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-mig-parted-config","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.7027156,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"gpu-operator"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.7207994,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"gpu-operator","name":"nvidia-mig-manager"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.720926,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"ready"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.790162,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.8474965,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.905581,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403563.9758897,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403564.0551643,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403564.111743,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
gpu-operator-856dd458c4-zmsdk gpu-operator {"level":"info","ts":1679403564.1118217,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy isn't ready","states not ready":["state-container-toolkit","state-operator-validation","state-device-plugin","state-dcgm-exporter","gpu-feature-discovery"]}
All others pods are failing to start due this error:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
They just stuck in PodInitializing state:
Failed to load logs: container "nvidia-device-plugin" in pod "nvidia-device-plugin-daemonset-hdrrv" is waiting to start: PodInitializing
Reason: BadRequest (400)
It seems that it is due to runtime toolkit is unsupported for Ubuntu 22.10, only Ubuntu 22.04. But where is that error, which states this and fails to install runtime?
Reverting back to Ubuntu 22.04, here is node creation, they are provisioning it with nvidia drivers:
So there shouldn't be problems with gpu-operator... But gpu-operator is installing another driver for some reason. Here is nvidia-smi:
root@vultr:~# nvidia-smi
Tue Mar 21 13:36:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GRID A100D-4C On | 00000000:04:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Ok, so one problem was found, it's that docker image was trying to call /usr/bin/nvidia-toolkit
instead of /usr/bin/nvidia-container-toolkit
, so creating a link:
ln -s /usr/bin/nvidia-container-toolkit /usr/bin/nvidia-toolkit
Helped to solve one problem, then after all validations have passed and all pods became green. I decided to restart all pods and it failed again on toolkit pod:
time="2023-03-21T14:45:44Z" level=info msg="Updating config"
time="2023-03-21T14:45:44Z" level=info msg="Successfully updated config"
time="2023-03-21T14:45:44Z" level=info msg="Flushing config"
time="2023-03-21T14:45:44Z" level=info msg="Successfully flushed config"
time="2023-03-21T14:45:44Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-03-21T14:45:44Z" level=info msg="Shutting Down"
time="2023-03-21T14:45:44Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: error running containerd command: signal: hangup"
After a few containerd restart and killing toolkit pod, i have managed to make it work... Very strange behavior...
time="2023-03-21T14:48:21Z" level=info msg="Setting up runtime"
time="2023-03-21T14:48:21Z" level=info msg="Starting 'setup' for containerd"
time="2023-03-21T14:48:21Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2023-03-21T14:48:21Z" level=info msg="Successfully parsed arguments"
time="2023-03-21T14:48:21Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
time="2023-03-21T14:48:21Z" level=info msg="Successfully loaded config"
time="2023-03-21T14:48:21Z" level=info msg="Config version: 2"
time="2023-03-21T14:48:21Z" level=info msg="Updating config"
time="2023-03-21T14:48:21Z" level=info msg="Successfully updated config"
time="2023-03-21T14:48:21Z" level=info msg="Flushing config"
time="2023-03-21T14:48:21Z" level=info msg="Successfully flushed config"
time="2023-03-21T14:48:21Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-03-21T14:48:21Z" level=info msg="Successfully signaled containerd"
time="2023-03-21T14:48:21Z" level=info msg="Completed 'setup' for containerd"
time="2023-03-21T14:48:21Z" level=info msg="Waiting for signal"
Why it is shutting down before these lines?
time="2023-03-21T14:48:21Z" level=info msg="Sending SIGHUP signal to containerd
time="2023-03-21T14:48:21Z" level=info msg="Successfully signaled containerd"
time="2023-03-21T14:48:21Z" level=info msg="Completed 'setup' for containerd"
time="2023-03-21T14:48:21Z" level=info msg="Waiting for signal"
like here:
time="2023-03-21T14:45:44Z" level=info msg="Sending SIGHUP signal to containerd"
time="2023-03-21T14:45:44Z" level=info msg="Shutting Down"
It failed to restart containerd? Why there is no error then? How "Successfully signaled containerd" is verified?
Ok, so linking issue is unrelated, it just can't restart containerd by sending SIGUP
signal to it.
time="2023-03-21T14:45:44Z" level=error msg="error running nvidia-toolkit: unable to setup runtime: error running containerd command: signal: hangup"
Node is using:
containerd containerd.io 1.6.18 2456e983eb9e37e47538f59ea18f2043c9a73640
Ok, after digging into it, i have found it in sources: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/nvidia-toolkit/run.go#L250 and then actual restart attempt here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L321
It seems that it is not selecting systemd switch case, but is trying to signal it as if it is running as standalone daemon, without init system wrapper.
Because it is defaulted here https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L49
I don't see any setting in helm chart options, to specify of a method for containerd restart.
I see that there is an env variable for that option, which is called CONTAINERD_RESTART_MODE
, but i don't see it in running container env, will try to modify daemon-set and see whether that it is forwarded into that toolkit.
https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L161
Yeah... When i have added this env to a daemonset, it started working properly and without errors. Worth adding that into helm chat and set it to a default value, which matches the source code.
- name: CONTAINERD_RESTART_MODE
value: systemd
It would be much better if you try both variants or identify how containerd is started on that node. It shouldn't be that hard to identify, just query systemd for that service and its status, if both exist, then use systemd, else just pretend that there is no systemd or containerd systemd service and restart it as a standalone daemon.
Hope this helps everyone else, who will find the same problem in the future. Wasted around 5 hours to identify this problem...
@denissabramovs even when run as a systemd
service, by default the toolkit container will kill the main containerd
process for least disruption to shim processes. The issue you were seeing was known with containerd v1.6.9+
and handled in the operator version v22.9.2
. I think you had created that issue as well :) Regarding driver installs, we install them using runfiles which will compiled and load modules. We are working on adding support for pre-compiled images, but that will not be available for all kernel variants but only for -generic
.
@denissabramovs - Did you manage to install the driver using operator successfully or relied on the pre-installed driver on the node by disabling it ? I think you disabled the driver via operator but wanted to double check as I am facing the similar issue. (Thank you for the detailed updates, it is helping for sure).
I just ran into a similar problem. For me, the driver was not installed at all. Checking the labels, there was one that said: nvidia.com/gpu.deploy.driver=pre-installed
. After removing this label, the driver installation started and completed successfully.
gpu-operator v23.6.1
Ubuntu 22.04.3 LTS
containerd.io 1.6.22 8165feabfdfe38c65b599c4993d227328c231fca
Kubernetes v1.25.13
Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04.
Logs are taken from nvidia driver installation daemonset's pod
nvidia-driver-daemonset-srf9k
:I'm more concerned about this:
I have checked, we have gcc installed on that machine and it is actually exactly gcc 11.3.0:
and: