NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.77k stars 287 forks source link

nvidia driver daemonset pod is recreated when ever there is a nfd restart #782

Closed charanteja333 closed 2 months ago

charanteja333 commented 3 months ago

1. Quick Debug Information

2. Issue or feature description

We are trying to seperate nfd out of gpu operator namespace and deploy seperately. We installed GPU operator with precompiled as false and when there is a restart of nfd pod, the driver daemon set is terminated and restarted. When this happens the node label nvidia.com/gpu-driver-upgrade-state is still on upgrade done, due to this the pods are not evicted on the node from which the driver must be installed and the driver stays in init crash back loop off waiting for pods to be evicted.

I tried setting various env parameters like ENABLE_AUTO_DRAIN , DRAIN_USE_FORCE on k8s-driver-manager but no luck.

nfd version: 0.15.4 driver version: 535.183.01

3. Steps to reproduce the issue

  1. Install GPU operator with useprecompiled as false
  2. Restart nfd of a node
  3. Driver daemon set is stuck in init loop off
cdesiniotis commented 3 months ago

@charanteja333 can you clarify what you mean by "restart nfd of a node"? Do you mean that the nfd-worker pod is restarted on a GPU node?

cdesiniotis commented 3 months ago

When this happens the node label nvidia.com/gpu-driver-upgrade-state is still on upgrade done, due to this the pods are not evicted on the node from which the driver must be installed and the driver stays in init crash back loop off waiting for pods to be evicted.

As a manual workaround, can you try labeling the node with nvidia.com/gpu-driver-upgrade-state=upgrade-required? This should trigger our upgrade controller to evict all the GPU pods on the node and allow the driver to come back to a running state.

charanteja333 commented 3 months ago

@cdesiniotis Yes nfd-worker pod is restarted on the node. Problem is this can happen frequently because life cycle of nfd and driver are seperate now and we have to work with sysadmin team to edit the labels which might cause down time.

age9990 commented 3 months ago

Maybe a related issue: NFD will remove and re-add node labels if nfd-worker pod is deleted (and re-created by the nfd-worker DS)

charanteja333 commented 2 months ago

@cdesiniotis when nfd is removing the labels and will the gpu operator pod recreate the driver daemon ? Because nfd is still marking the node as upgrade-done ( same values as before ), driver is unable to install as gpu pods are present and running on the node.

cdesiniotis commented 2 months ago

Thanks @age9990 for providing the relevant issue.

@charanteja333 Until a fix is available in NFD, I would advise downgrading NFD to a version not affected by this bug, e.g. <= v0.14.6, or disabling the garbage collector in the NFD helm values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/deployment/helm/node-feature-discovery/values.yaml#L514

cdesiniotis commented 2 months ago

NFD v0.16.2 has been released which addresses this issue: https://github.com/kubernetes-sigs/node-feature-discovery/releases/tag/v0.16.2

cdesiniotis commented 2 months ago

We have upgraded to NFD v0.16.2 in main: https://github.com/NVIDIA/gpu-operator/pull/735

cdesiniotis commented 2 months ago

Closing this issue as GPU Operator 24.6.0 has been released and is using NFD v0.16.3 which does not contain this bug. Please re-open if you are still encountering issues.