Open Levi080513 opened 1 month ago
@Levi080513 thanks for the detailed issue! I think our logic which detects stale DaemonSets and cleans them up can be improved to avoid the behavior you are experiencing.
Is there any latest progress on this issue?
@Levi080513 This change was merged into master
and should fix the issue you reported: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1085. It will be included in the next release. If you are willing to try out these changes before then and confirm it resolves your issue that would be helpful as well.
I cherry-pick this MR to version 23.6.2 and it works well. Thx!
Hi @cdesiniotis
We are facing a similart issue but not a precompiled library. Whenever nfd restarts , driver daemon is restarting as well. When driver restarts, it is stuck leading to either restarting the gpu pods or draining the node. We are using chart 24.3.0. will fix be cherry picked to 24.3.0 as well ?
@charanteja333 what you described is different than what is being reported in this issue. Can you create a new issue with more details on the behavior you are observing?
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to eleted and recreated nvidia-driver-daemonset.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
driver.usePrecompiled = true
.systemctl stop kubelet
nvidia-driver-daemonset
will be deleted and recreated, and this process will continue until the GPU node is Ready.When the node is not ready, the node taints like this:
But the
nvidia-driver-daemonset
pod tolerations is like this:Node taint
node.kubernetes.io/unreachable:NoSchedule
is not tolerated, sonvidia-driver-daemonset
.status.desiredNumberScheduled is 0.Following the logic of
cleanupStalePrecompiledDaemonsets
,nvidia-driver-daemonset
will be deleted and then created again because the cluster still has GPU nodes.https://github.com/NVIDIA/gpu-operator/blob/a9e6a947216518e5940c21523c2400a2f8f4def5/controllers/object_controls.go#L3689-L3728
This does not appear to be normal behavior.
The temporary solution is to add the following configuration when installing gpu-operator:
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com