GPU pods end up in CrashLoopBackoff state after eviction

futurwasfree commented 3 months ago

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
Kernel Version: 5.15.0-1066
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): AKS
GPU Operator Version: 24.3.0

2. Issue or feature description

GPU pods end up in endless CrashLoopBackoff state due to missing driver (and nvidia-smi) and manual pod termination required (kill). After pod termination everything runs just fine.

3. Steps to reproduce the issue

Configuration In this setup we pass --set driver.upgradePolicy.autoUpgrade=false and letting k8s-driver-manager handle the update. NVIDIADriver has the following configuration associated with it:

manager:
env:
  - name: ENABLE_GPU_POD_EVICTION
    value: "false"
  - name: ENABLE_AUTO_DRAIN
    value: "true"
  - name: DRAIN_USE_FORCE
    value: "false"
  - name: DRAIN_POD_SELECTOR_LABEL
    value: ""
  - name: DRAIN_TIMEOUT_SECONDS
    value: "0s"
  - name: DRAIN_DELETE_EMPTYDIR_DATA
    value: "false"

Repro Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running. GPU pods get evicted and re-scheduled again. After re-scheduling they end up in CrashLoopBackOff state.

4. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
[ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

cdesiniotis commented 3 months ago

GPU pods get evicted and re-scheduled again. After re-scheduling they end up in CrashLoopBackOff state.

Which pods end up in CrashLoopBackOff? Application pods or pods managed by the GPU Operator, like device-plugin, gpu-feature-discovery, etc.?

futurwasfree commented 3 months ago

@cdesiniotis It's application pods that exit with > 0 code when it fails to communicate to cuda (no driver mounted to container).

it sounds a bit like race condition in a sense might or might not finish everything related to upgrade and nvidia-driver will be available. but most of the time it's an application pod without any driver mounted (after eviction and upgrade procedure).

manual kill signal sent to pod fixes it (every time).

NVIDIA / gpu-operator