NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

The pod and namespace information in the monitoring indicators of some Gpus occupied by Pods is empty #373

Open qingfenghcy opened 3 months ago

qingfenghcy commented 3 months ago

What is the version?

3.1.8-3.1.5

What happened?

I have installed the daemonset of dcgm-exporter and gpu-nvidia in the k8s cluster, and now I have the ability to monitor GPU-related indicators. There are more than 200 nodes in the cluster. I find that T4 Gpus on some nodes have been occupied by Pods, but the pod and namespace fields in the monitoring indicator information are empty. I compared the configuration of the node information between the non-empty and empty configurations and found no difference. At the same time, the pod logs of dcgm and gpu-nvidia are not different and abnormal.

What did you expect to happen?

I want to know why and what to look for.

What is the GPU model?

Each machine has a T4 GPU

What is the environment?

A k8s cluster with 200+ bare metal nodes.

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response