Open habjouqa opened 4 months ago
I've just send the must gather to the following email: operator_feedback@nvidia.com
thanks @habjouqa will take a look
Hello @shivamerla, has there been any update on this?
There's a workaround that doesn't involve reinstalling the operator:
After a few minutes the items in states not ready:
should resolve themselves.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
Installed OCP 4.14.23 cluster, where one worker node has a GPU but the
gpu-cluster-policy
has failed and the GPU-related pods are not working.3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
gpu-cluster-policy
should be "State: ready," but the state is "State: not ready," causing the GPU-related pods to fail (see attached screenshot "gpu-pods.png").I contacted IBM Support, and they referred me to the logs at
/var/log/nvidia-installer.log
. These logs show failed driver installations. I uninstalled the "Node Feature Discovery Operator" and "Nvidia GPU Operator," then reinstalled them following the same steps and restarted the node. The drivers are now successfully installed. Attached are twonvidia-installer
logs showing the state before and after the restart:4. Information to attach (optional if deemed irrelevant)
nvidia-installer_BeforeRestart.log nvidia-installer_AfterRestart.log
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com