I have installed the daemonset of dcgm-exporter and gpu-nvidia in the k8s cluster, and now I have the ability to monitor GPU-related indicators. There are more than 200 nodes in the cluster. I find that T4 Gpus on some nodes have been occupied by Pods, but the pod and namespace fields in the monitoring indicator information are empty. I compared the configuration of the node information between the non-empty and empty configurations and found no difference. At the same time, the pod logs of dcgm and gpu-nvidia are not different and abnormal.
What did you expect to happen?
I want to know why and what to look for.
What is the GPU model?
Each machine has a T4 GPU
What is the environment?
A k8s cluster with 200+ bare metal nodes.
How did you deploy the dcgm-exporter and what is the configuration?
What is the version?
3.1.8-3.1.5
What happened?
I have installed the daemonset of dcgm-exporter and gpu-nvidia in the k8s cluster, and now I have the ability to monitor GPU-related indicators. There are more than 200 nodes in the cluster. I find that T4 Gpus on some nodes have been occupied by Pods, but the pod and namespace fields in the monitoring indicator information are empty. I compared the configuration of the node information between the non-empty and empty configurations and found no difference. At the same time, the pod logs of dcgm and gpu-nvidia are not different and abnormal.
What did you expect to happen?
I want to know why and what to look for.
What is the GPU model?
Each machine has a T4 GPU
What is the environment?
A k8s cluster with 200+ bare metal nodes.
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response