NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.74k stars 281 forks source link

Pods stuck in Terminating after upgrade to v1.11.1 #399

Open neggert opened 2 years ago

neggert commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

After upgrading from gpu-operator v1.10.0 to v1.11.1, the stack does not seem to come up cleanly without manual intervention. I end up with the gpu-feature-discovery, nvidia-dcgm-exporter and nvidia-device-plugin-daemonset pods stuck in Terminating. Manually restarting the container toolkit by either deleting the nvidia-container-toolkit-daemonset pod or doing kubectl rollout restart daemonset nvidia-container-toolkit-daemonset seems to resolve the problem, but I shouldn't need to manually intervene.

2. Steps to reproduce the issue

Install GPU operator from the helm chart (via Argo CD) using these values:

nfd:
  enabled: true
mig:
  strategy: mixed
driver:
  version: "515.48.07"
  rdma:
    enabled: false
  manager:
    env:
      - name: ENABLE_AUTO_DRAIN
        value: "true"
      - name: DRAIN_USE_FORCE
        value: "true"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "true"
toolkit:
  enabled: true
  version: "v1.10.0-centos7"
dcgmExporter:
  version: "2.4.5-2.6.7-ubuntu20.04"
migManager:
  enabled: true
  config:
    name: mig-parted-config
vgpuManager:
  enabled: false
vgpuDeviceManager:
  enabled: false 
vfioManager:
  enabled: false
sandboxDevicePlugin:
  enabled: false

Remove and re-add the operands from a node:

kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands=false
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands-

After waiting a few minutes for the driver and toolkit pods to become ready, several pods are stuck in terminating and cuda-validator is in Init:Error.

kubectl get pods -n nvidia-gpu-operator
NAME                                                              READY   STATUS        RESTARTS   AGE
gpu-feature-discovery-tz7kr                                       0/1     Terminating   0          3m58s
gpu-operator-84d9f557c8-2jtdp                                     1/1     Running       0          125m
nvidia-container-toolkit-daemonset-d6mzq                          1/1     Running       0          3m58s
nvidia-cuda-validator-47lz5                                       0/1     Init:Error    4          103s
nvidia-dcgm-exporter-zhwr6                                        0/1     Terminating   0          3m58s
nvidia-device-plugin-daemonset-r4gfd                              0/1     Terminating   1          3m58s
nvidia-device-plugin-validator-qkq9d                              0/1     Completed     0          99m
nvidia-driver-daemonset-78dn5                                     1/1     Running       0          3m58s
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4jdtgj   1/1     Running       0          125m
nvidia-gpu-operator-node-feature-discovery-worker-j6zwm           1/1     Running       2          124m
nvidia-gpu-operator-node-feature-discovery-worker-mk9qd           1/1     Running       0          124m
nvidia-gpu-operator-node-feature-discovery-worker-xxvgd           1/1     Running       0          124m
nvidia-mig-manager-hhdq6                                          1/1     Running       0          87s
nvidia-operator-validator-p4h7q                                   0/1     Init:2/4      0          3m49s

I don't see anything unusual in either the driver or container toolkit pod logs.

If I manually restart the container toolkit and wait a few minutes, everything comes up as expected.

kubectl rollout restart daemonset nvidia-container-toolkit-daemonset -n nvidia-gpu-operator

kubectl get pods -n nvidia-gpu-operator
NAME                                                              READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rml5z                                       1/1     Running     0          114s
gpu-operator-84d9f557c8-2jtdp                                     1/1     Running     0          134m
nvidia-container-toolkit-daemonset-9k8mg                          1/1     Running     0          67s
nvidia-cuda-validator-gzjrj                                       0/1     Completed   0          37s
nvidia-dcgm-exporter-wmzcz                                        1/1     Running     0          114s
nvidia-device-plugin-daemonset-mpmdz                              1/1     Running     0          114s
nvidia-device-plugin-validator-425b8                              0/1     Completed   0          30s
nvidia-driver-daemonset-78dn5                                     1/1     Running     0          12m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4jdtgj   1/1     Running     0          134m
nvidia-gpu-operator-node-feature-discovery-worker-j6zwm           1/1     Running     2          133m
nvidia-gpu-operator-node-feature-discovery-worker-mk9qd           1/1     Running     0          133m
nvidia-gpu-operator-node-feature-discovery-worker-xxvgd           1/1     Running     0          133m
nvidia-mig-manager-hhdq6                                          1/1     Running     0          10m
nvidia-operator-validator-ztbjv                                   1/1     Running     0          77s

3. Information to attach (optional if deemed irrelevant)

Other Info: K8s 1.21.10 containerd 1.6.1 containerd-config.toml.txt CentOS 7.9.2009

shivamerla commented 2 years ago

@neggert Thanks for reporting this, will try to reproduce. Question, why was this step done after upgrade to v1.11.1?

kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands=false
kubectl label node dev-worker-gpu-0 nvidia.com/gpu.deploy.operands-
neggert commented 2 years ago

I wanted to check to see if the issue was a result of upgrading existing nodes in place. In an attempt to rule that out, I used the label to completely remove the GPU operator from the node, then re-deploy it. In the past, I've found that this is a good way to "reset" anything related to the GPU operator that gets into a weird state.

I get the same result whether I include that step or not, so I don't think the issue is related to the upgrade process.

neggert commented 1 year ago

@shivamerla Any luck in reproducing this? Happy to provide more info if you let me know what you need.

shivamerla commented 1 year ago

@neggert Can you attach /var/log/messages or logs from journalctl -xb > journal.log, This might help us to understand if containerd is reloaded correctly after toolkit upgrade. If it got into error state after first reset, that might explain why containers were not reaped correctly.

toolkit log:

time="2022-08-26T18:19:31Z" level=info msg="Successfully loaded config"
time="2022-08-26T18:19:31Z" level=info msg="Config version: 2"
time="2022-08-26T18:19:31Z" level=info msg="Updating config"
time="2022-08-26T18:19:31Z" level=info msg="Successfully updated config"
time="2022-08-26T18:19:31Z" level=info msg="Flushing config"
time="2022-08-26T18:19:31Z" level=info msg="Successfully flushed config"
time="2022-08-26T18:19:31Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-08-26T18:19:31Z" level=info msg="Successfully signaled containerd"
time="2022-08-26T18:19:31Z" level=info msg="Completed 'setup' for containerd"
time="2022-08-26T18:19:31Z" level=info msg="Waiting for signal"

but device-plugin is stuck in terminating as containerd is not ready yet.

  Warning  FailedCreatePodSandBox  4m57s                  kubelet            Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused"
neggert commented 1 year ago

Getting slightly different behavior when I went to reproduce this again. Now things are stuck in a crash loop

gpu-feature-discovery-6rb27                                       0/1     Init:CrashLoopBackOff   11         37m
gpu-operator-84d9f557c8-gp9p4                                     1/1     Running                 0          37m
nvidia-container-toolkit-daemonset-blbt4                          1/1     Running                 0          37m
nvidia-dcgm-exporter-cwnhb                                        0/1     Init:CrashLoopBackOff   11         37m
nvidia-device-plugin-daemonset-gsbdr                              0/1     Init:CrashLoopBackOff   11         37m
nvidia-driver-daemonset-7hq8v                                     1/1     Running                 0          37m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4tsqxv   1/1     Running                 0          37m
nvidia-gpu-operator-node-feature-discovery-worker-kkdcp           1/1     Running                 0          37m
nvidia-gpu-operator-node-feature-discovery-worker-m9qpd           1/1     Running                 0          37m
nvidia-gpu-operator-node-feature-discovery-worker-tqhj2           1/1     Running                 0          37m
nvidia-mig-manager-m6nlc                                          0/1     Init:CrashLoopBackOff   11         37m
nvidia-operator-validator-j24lb                                   0/1     Init:CrashLoopBackOff   11         36m

Events look like this:

Events:
  Type     Reason                  Age                     From               Message
  ----     ------                  ----                    ----               -------
  Normal   Scheduled               5m54s                   default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-device-plugin-daemonset-gsbdr to dev-worker-gpu-0
  Warning  FailedCreatePodSandBox  3m48s (x11 over 5m54s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
  Warning  FailedCreatePodSandBox  3m33s                   kubelet            Failed to create pod sandbox: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused"
  Normal   Pulled                  3m2s (x3 over 3m20s)    kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.11.1" already present on machine
  Normal   Created                 3m2s (x3 over 3m20s)    kubelet            Created container toolkit-validation
  Warning  Failed                  3m1s (x3 over 3m20s)    kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: mount error: stat failed: /run/nvidia/driver/proc/driver/nvidia/gpus/0000:13:00.0: no such file or directory: unknown
  Warning  BackOff  48s (x13 over 3m19s)  kubelet  Back-off restarting failed container

As before, restarting the container-toolkit pod resolves the problem.

journal logs are attached. journal.log.gz

neggert commented 1 year ago

The above problem seems to happen because the driver container is not populating /run/nvidia/driver/proc/driver/nvidia. Rebooting the node seems to revolve that. Not ideal, but I think it might be a separate issue.

I see pods stuck in Terminating when the node comes up after reboot, so it's actually a great way to get some nice clean logs that demonstrate the problem.

Pods:

gpu-feature-discovery-mcfqc                                       0/1     Terminating             0          25m
gpu-operator-84d9f557c8-fw8sv                                     1/1     Running                 0          25m
nvidia-container-toolkit-daemonset-f24pl                          1/1     Running                 1          24m
nvidia-cuda-validator-l4gqp                                       0/1     Init:CrashLoopBackOff   2          40s
nvidia-dcgm-exporter-ms8td                                        0/1     Terminating             5          25m
nvidia-device-plugin-daemonset-jmkl5                              0/1     Terminating             1          25m
nvidia-driver-daemonset-dmgfx                                     1/1     Running                 1          24m
nvidia-gpu-operator-node-feature-discovery-master-79bb9ff4p6dmn   1/1     Running                 0          25m
nvidia-gpu-operator-node-feature-discovery-worker-2jpxg           1/1     Running                 0          24m
nvidia-gpu-operator-node-feature-discovery-worker-cq2pz           1/1     Running                 1          25m
nvidia-gpu-operator-node-feature-discovery-worker-hhwz6           1/1     Running                 1          25m
nvidia-mig-manager-xtvlb                                          1/1     Running                 0          25m
nvidia-operator-validator-9cgqc                                   0/1     Init:2/4                1          8m35s

Logs: journal.log.gz

shivamerla commented 1 year ago

@neggert Can you set env for toolkit with --set toolkit.env[0].name=CONTAINERD_RESTART_MODE --set toolkit.env[0].value=none

This will avoid containerd reloads in your case and for upgrades we don't really need a reload as nvidia-container-runtime binary path is same. We will continue to look into root cause of this.

neggert commented 1 year ago

@shivamerla That does solve the problem with the upgrade, but it means that we need to manually log into the node to restart containerd when adding, removing, or reconfiguring the container toolkit. I'd rather restart the daemonset :P.

shivamerla commented 1 year ago

@neggert Agree, this is just a workaround until we figure out why containerd reloads are causing issue in your case. Currently we don't modify the runtime config over operator/toolkit upgrades, it remains the same but we still do runtime reload. The above workaround fixes that issue. That said, the toolkit config might change with the future versions so runtime reload is still required in those cases.

alloydm commented 1 year ago

@neggert Can you set env for toolkit with --set toolkit.env[0].name=CONTAINERD_RESTART_MODE --set toolkit.env[0].value=none

This will avoid containerd reloads in your case and for upgrades we don't really need a reload as nvidia-container-runtime binary path is same. We will continue to look into root cause of this.

I set these env for toolkit, the toolkit was up but the dcgm-exporter and device plugin failed with this error kubectl describe po nvidia-dcgm-exporter-bhl8f -n gpu Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

If I unset the CONTAINERD_RESTART_MODE env for toolkit, all the gpu related pods start running

shivamerla commented 1 year ago

@alloydm that workaround was mentioned specifically when containerd was not handling restarts properly. Were you seeing same behavior that you needed to apply this? By default we want container-toolkit to be able to reload containerd after applying config for nvidia-container-runtime, without which none of the other operator pods would come up.

alloydm commented 1 year ago

@shivamerla Oh okay, Thank you for clarification. I thought with latest change there is no need to restart containerd. Thank you for the quick update

alloydm commented 1 year ago

@shivamerla Is there any possible way/ feature to skip containerd reload (restart) and handle nvidia-container-runtime config any other way?