Open futurwasfree opened 2 months ago
Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running.
@futurwasfree is there a particular reason why you are directly killing the nvidia-driver-daemonset pod? If you need to perform a driver upgrade, you should be editing driver.version
in clusterpolicy, and the driver upgrade-controller would take care of facilitating the upgrade, as documented here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html
@cdesiniotis this way I'm mimicking the actual node restart, e.g. after upgrade or for other reason. behaviour is the same as I've described in initial post.
Hi @futurwasfree , I get the same errors even though auto drain and auto eviction are allowed. Do you know what could be the reason?
I get the following errors: Auto eviction of GPU pods on node ... is disabled by the upgrade policy Auto drain of the node ... is disabled by the upgrade policy
This sounds exactly like my issue. Some configuration checks in k8s-driver-manager
are broken in my opinion.
Btw, currently I'm using slightly different configuration, but it also have problems and require manual attention from time to time (on node restart): https://github.com/NVIDIA/gpu-operator/issues/902
Do you now how / where I can configure the K8s-Driver-Manager settings?
These settings: manager: env:
And where can do this --set driver.upgradePolicy.autoUpgrade=false instruction?
Thanks for your efforts!
Basically there are two ways of doing upgrade: either with Upgrade Controller
or k8s-driver-manager
(I'm using the latter one at the moment), you could read more about them on this page: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html
Re: env settings like ENABLE_GPU_POD_EVICTION they're on driver level. So it's your NVIDIADriver entry, either defined on Cluster Policy CRD or NVIDIA Driver CRD: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-configuration.html
Re: --set driver.upgradePolicy.autoUpgrade=false
. I set it as override when I do helm upgrade
:
helm upgrade --wait gpu-operator \
-n gpu-operator --create-namespace nvidia/gpu-operator \
--set operator.defaultRuntime="containerd" \
--set driver.nvidiaDriverCRD.enabled=true \
--set driver.nvidiaDriverCRD.deployDefaultCR=false \
--set driver.upgradePolicy.autoUpgrade=false \
...
See log below, specifically lines with
Current value of AUTO_UPGRADE_POLICY_ENABLED=true
andAuto eviction of GPU pods ..
,Auto drain ...
:just by reading the logic in https://github.com/NVIDIA/k8s-driver-manager/blob/master/driver-manager (
_is_driver_auto_upgrade_policy_enabled
in particlar) it should not happen, I believe1. Quick Debug Information
2. Issue or feature description
Auto eviction and auto drain checks are further evaluated and not stopped by
_is_driver_auto_upgrade_policy_enabled
3. Steps to reproduce the issue
Default install with Upgrade Controller enabled. Kill the
nvidia-driver-daemonset
pod and trigger the driver reinstall on a node with GPU enabled pods already running.4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com