Closed johnzheng1975 closed 1 month ago
I found gpu-operator-node-feature-discovery-worker has 5 pods, no "Node Selector", why this need be installed on non-gpu node?
Node feature discovery labels node with hardware features / system configuration. The GPU Operator depends on these labels to know which worker nodes have GPU(s). If you would like to restrict which nodes NFD worker pods get scheduled to, you can configure a node selector in the NFD helm values.
I found "nvidia-device-plugin-mps-control-daemon", "nvidia-driver-daemonset", "nvidia-mig-manager " has no pods
If drivers are pre-installed on your GPU nodes, you can explicitly disable the GPU Operator-managed driver by setting driver.enabled=false
in clusterpolicy -- that will prevent the nvidia-driver-daemonset
from getting created. Similarly, if you don't have any MIG capable GPUs in your cluster, you can explicitly disable the mig-manager component by setting migManager.enabled=false
in ClusterPolicy -- that will prevent the nvidia-mig-manager
daemonset from getting created.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue. In my eks, has five nodes, one of them is gpu-node.
I assumed:
However,
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log