NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.87k stars 303 forks source link

AUTO_UPGRADE_POLICY_ENABLED set to true, but eviction and drain are "disabled by the upgrade policy" #901

Open futurwasfree opened 3 months ago

futurwasfree commented 3 months ago

See log below, specifically lines with Current value of AUTO_UPGRADE_POLICY_ENABLED=true and Auto eviction of GPU pods .. , Auto drain ...:

Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label                                                                                                                             
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'                                                                                                                                               
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label                                                                                                                              
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'                                                                                                                                                
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label                                                                                                                                  
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label                                                                                                                          
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'                                                                                                                                            
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label                                                                                                                                  
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label                                                                                                                                           
Current value of 'nvidia.com/gpu.deploy.dcgm=true'                                                                                                                                                             
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label                                                                                                                                    
Current value of 'nvidia.com/gpu.deploy.mig-manager='                                                                                                                                                          
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label                                                                                                                                           
Current value of 'nvidia.com/gpu.deploy.nvsm='                                                                                                                                                                 
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label                                                                                                                              
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='                                                                                                                                                    
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label                                                                                                                          
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='                                                                                                                                                
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label                                                                                                                            
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='                                                                                                                                                  
Current value of AUTO_UPGRADE_POLICY_ENABLED=true'                                                                                                                                                             
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels                                                                                                    
node/aks-gputest-50947407-vmss000001 labeled                                                                                                                                                                   
Waiting for the operator-validator to shutdown                                                                                                                                                                 
pod/nvidia-operator-validator-kqw2v condition met                                                                                                                                                              
Waiting for the container-toolkit to shutdown                                                                                                                                                                  
Waiting for the device-plugin to shutdown                                                                                                                                                                      
Waiting for gpu-feature-discovery to shutdown                                                                                                                                                                  
Waiting for dcgm-exporter to shutdown                                                                                                                                                                          
Waiting for dcgm to shutdown                                                                                                                                                                                   
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                            
Unloading NVIDIA driver kernel modules...                                                                                                                                                                      
nvidia_modeset       1306624  0                                                                                                                                                                                
nvidia_uvm           1527808  4                                                                                                                                                                                
nvidia              56717312  143 nvidia_uvm,nvidia_modeset                                                                                                                                                    
drm                   622592  3 drm_kms_helper,nvidia,hyperv_drm                                                                                                                                               
i2c_core               90112  3 drm_kms_helper,nvidia,drm                                                                                                                                                      
Could not unload NVIDIA driver kernel modules, driver is in use                                                                                                                                                
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                                       
Failed to uninstall nvidia driver components                                                                                                                                                                   
Auto eviction of GPU pods on node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                            
Auto drain of the node aks-gputest-50947407-vmss000001 is disabled by the upgrade policy                                                                                                                       
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels                                                                                                      
node/aks-gputest-50947407-vmss000001 labeled             

just by reading the logic in https://github.com/NVIDIA/k8s-driver-manager/blob/master/driver-manager (_is_driver_auto_upgrade_policy_enabled in particlar) it should not happen, I believe

1. Quick Debug Information

2. Issue or feature description

Auto eviction and auto drain checks are further evaluated and not stopped by _is_driver_auto_upgrade_policy_enabled

3. Steps to reproduce the issue

Default install with Upgrade Controller enabled. Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running.

4. Information to attach (optional if deemed irrelevant)

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

cdesiniotis commented 3 months ago

Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running.

@futurwasfree is there a particular reason why you are directly killing the nvidia-driver-daemonset pod? If you need to perform a driver upgrade, you should be editing driver.version in clusterpolicy, and the driver upgrade-controller would take care of facilitating the upgrade, as documented here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html

futurwasfree commented 3 months ago

@cdesiniotis this way I'm mimicking the actual node restart, e.g. after upgrade or for other reason. behaviour is the same as I've described in initial post.

SeanEsper commented 3 weeks ago

Hi @futurwasfree , I get the same errors even though auto drain and auto eviction are allowed. Do you know what could be the reason?

SeanEsper commented 3 weeks ago

I get the following errors: Auto eviction of GPU pods on node ... is disabled by the upgrade policy Auto drain of the node ... is disabled by the upgrade policy

futurwasfree commented 3 weeks ago

This sounds exactly like my issue. Some configuration checks in k8s-driver-manager are broken in my opinion.

Btw, currently I'm using slightly different configuration, but it also have problems and require manual attention from time to time (on node restart): https://github.com/NVIDIA/gpu-operator/issues/902

SeanEsper commented 3 weeks ago

Do you now how / where I can configure the K8s-Driver-Manager settings?

These settings: manager: env:

And where can do this --set driver.upgradePolicy.autoUpgrade=false instruction?

Thanks for your efforts!

futurwasfree commented 3 weeks ago

Basically there are two ways of doing upgrade: either with Upgrade Controller or k8s-driver-manager (I'm using the latter one at the moment), you could read more about them on this page: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html Re: env settings like ENABLE_GPU_POD_EVICTION they're on driver level. So it's your NVIDIADriver entry, either defined on Cluster Policy CRD or NVIDIA Driver CRD: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html Re: --set driver.upgradePolicy.autoUpgrade=false. I set it as override when I do helm upgrade:

helm upgrade --wait gpu-operator \
    -n gpu-operator --create-namespace nvidia/gpu-operator \
    --set operator.defaultRuntime="containerd" \
    --set driver.nvidiaDriverCRD.enabled=true \
    --set driver.nvidiaDriverCRD.deployDefaultCR=false \
    --set driver.upgradePolicy.autoUpgrade=false \
 ...