NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.79k stars 286 forks source link

gpu driver is in init state after rebooting the gpu node #566

Open alloydm opened 1 year ago

alloydm commented 1 year ago

1. Quick Debug Information

2. Issue or feature description

I have a kubernetes cluster with gpu operator installer (23.3.2) on Tesla p4 gpu node, I am running kubeflow based jupyter notebook which consumes gpu node. This kubeflow based jupyter notebook pod(statefulset as replication controller) also has Persistent volume claims attached to it. Whenever the gpu node is rebooted, the driver-daemonset pod stucks in init stage, that is the k8s-driver-manager (container) will be stuck in evicting the kubeflow jupyter notebook pod, only when we forecfully delete the notebook pod, the driver daeomonset goes ahead with execution kubectl delete pod juypter-nb --force --grace-period=0

Screenshot 2023-08-09 at 11 52 18 AM Screenshot 2023-08-09 at 11 54 35 AM

I have attached the k8s-driver-manager container's environmental variables that I have set

Screenshot 2023-08-09 at 12 39 07 PM

3. Steps to reproduce the issue

  1. Create a kubernetes cluster with Rhel 8.8 OS and deploy gpu operator(23.3.2) using helm
  2. Create a kubeflow based jupyter notebook statefulset(on gpu node) consuming Persistent volume claim, which will utilise gpu
  3. once the notebook pod is up and running on the gpu node, we have to reboot that gpu node.

4. Information to attach (optional if deemed irrelevant)

shivamerla commented 1 year ago

@alloydm thanks for reporting this. When nvidia driver modules are not loaded (during reboot scenario), we can avoid evicting user GPU pods. Will address this in next patch release.

shivamerla commented 1 year ago

@alloydm there are couple of ways this can be mitigated.

  1. Enable driver upgrade controller with driver.upgradePolicy.autoUpgrade=true, in that case the initContainer will not handle GPU pod eviction, but the upgrade controller within the operator. This is triggered only during driver daemonset spec updates and not host reboot.
  2. Disable "ENABLE_GPU_POD_EVICTION" with the driver manager. With this disabled, on node reboot, since no driver is loaded, we do not attempt GPU pod eviction or node drain. But in cases when the driver container restarts abruptively, it will not evict GPU pods and will be stuck in crashloop.

We will add a fix to avoid nvdrain during node reboot case.

alloydm commented 12 months ago

@shivamerla we dont want to enable autoUpgrade to true Disable "ENABLE_GPU_POD_EVICTION" with the driver manager - tried this, but since this is a statefulset controlled pod, this pod goes to terminating stage when node goes down, and then stays in terminating state forever

I am attaching kubernetes doc information on why this is happening https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

We are not hitting this issue with upgrade as there is option in driver env to forcefully deleting user gpu pod. can we have that forcefully deleting user gpu pod env here too?