NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.85k stars 299 forks source link

GPU pods end up in CrashLoopBackoff state after eviction #902

Open futurwasfree opened 3 months ago

futurwasfree commented 3 months ago

1. Quick Debug Information

2. Issue or feature description

GPU pods end up in endless CrashLoopBackoff state due to missing driver (and nvidia-smi) and manual pod termination required (kill). After pod termination everything runs just fine.

3. Steps to reproduce the issue

  1. Configuration In this setup we pass --set driver.upgradePolicy.autoUpgrade=false and letting k8s-driver-manager handle the update. NVIDIADriver has the following configuration associated with it:

    manager:
    env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "false"
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "false"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "false"
  2. Repro Kill the nvidia-driver-daemonset pod and trigger the driver reinstall on a node with GPU enabled pods already running. GPU pods get evicted and re-scheduled again. After re-scheduling they end up in CrashLoopBackOff state.

4. Information to attach (optional if deemed irrelevant)

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

cdesiniotis commented 3 months ago

GPU pods get evicted and re-scheduled again. After re-scheduling they end up in CrashLoopBackOff state.

Which pods end up in CrashLoopBackOff? Application pods or pods managed by the GPU Operator, like device-plugin, gpu-feature-discovery, etc.?

futurwasfree commented 3 months ago

@cdesiniotis It's application pods that exit with > 0 code when it fails to communicate to cuda (no driver mounted to container).

it sounds a bit like race condition in a sense might or might not finish everything related to upgrade and nvidia-driver will be available. but most of the time it's an application pod without any driver mounted (after eviction and upgrade procedure).

manual kill signal sent to pod fixes it (every time).