Open alloydm opened 1 year ago
@alloydm thanks for reporting this. When nvidia
driver modules are not loaded (during reboot scenario), we can avoid evicting user GPU pods. Will address this in next patch release.
@alloydm there are couple of ways this can be mitigated.
driver.upgradePolicy.autoUpgrade=true
, in that case the initContainer will not handle GPU pod eviction, but the upgrade controller within the operator. This is triggered only during driver daemonset spec updates and not host reboot.We will add a fix to avoid nvdrain
during node reboot case.
@shivamerla we dont want to enable autoUpgrade to true Disable "ENABLE_GPU_POD_EVICTION" with the driver manager - tried this, but since this is a statefulset controlled pod, this pod goes to terminating stage when node goes down, and then stays in terminating state forever
I am attaching kubernetes doc information on why this is happening https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/
We are not hitting this issue with upgrade as there is option in driver env to forcefully deleting user gpu pod. can we have that forcefully deleting user gpu pod env here too?
1. Quick Debug Information
2. Issue or feature description
I have a kubernetes cluster with gpu operator installer (23.3.2) on Tesla p4 gpu node, I am running kubeflow based jupyter notebook which consumes gpu node. This kubeflow based jupyter notebook pod(statefulset as replication controller) also has Persistent volume claims attached to it. Whenever the gpu node is rebooted, the driver-daemonset pod stucks in init stage, that is the k8s-driver-manager (container) will be stuck in evicting the kubeflow jupyter notebook pod, only when we forecfully delete the notebook pod, the driver daeomonset goes ahead with execution
kubectl delete pod juypter-nb --force --grace-period=0
I have attached the k8s-driver-manager container's environmental variables that I have set
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
kubectl get po -n gpu-operator
kubectl logs nvidia-driver-daemonset-stmk7 -n gpu-operator -f -c k8s-driver-manager
kubectl get pod -n admin