NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.53k stars 263 forks source link

After the GPU node is restarted, an error occurs when the nvidia-driver-daemonset pod is started in the offline environment #703

Open sunwuyan opened 2 months ago

sunwuyan commented 2 months ago

After using gpu-operator to integrate the GPU successfully, when restarting the GPU node, can I not reinstall the driver?Because my K8S cluster cannot access the public network under normal conditions, every time the nvidia-driver-daemonset pod is restarted, it needs to be connected to the network to complete the startup, otherwise the error will be reported:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 5.15.0-67-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-security InRelease' is not signed. Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...

I tried setting driver.upgradePolicy.autoUpgrade to false and it didn't work either

cdesiniotis commented 2 months ago

@sunwuyan the driver will always be reinstalled a reboot, this is the current limitation. Please see this comment: https://github.com/NVIDIA/gpu-operator/issues/705#issuecomment-2077761858

sunwuyan commented 2 months ago

@sunwuyan the driver will always be reinstalled a reboot, this is the current limitation. Please see this comment: #705 (comment)

3q,I looked at the code, and it seems that if the driver.usePrecompile property is set to true, it shouldn't repeat the network update,but I haven't tried it yet, my operating system is ubuntu20.04

cdesiniotis commented 2 months ago

Correct. If precompiled drivers are used, then we do not need network connectivity to update the package cache.

However, we do not have precompiled driver images published for Ubuntu 20.04. We only have tags for Ubuntu 22.04, see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions

sunwuyan commented 1 month ago

Correct. If precompiled drivers are used, then we do not need network connectivity to update the package cache.

However, we do not have precompiled driver images published for Ubuntu 20.04. We only have tags for Ubuntu 22.04, see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions

3q