NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

Why duplicate metrics occured when a job scheduling to this server #174

Open WYmindsky opened 3 years ago

WYmindsky commented 3 years ago

yaml:pod-gpu-exporter-daemonset.yaml docker image:pod-gpu-metrics-exporter:1.0.0-alpha dcgm:dcgm-exporter:1.4.6

Duplicate metrics occured when a job scheduling to this server for long time 11

JulesBelveze commented 3 years ago

Hey @WYmindsky I'm experiencing the same behaviour. Did you find out why this occurs?

WYmindsky commented 3 years ago

Hey @WYmindsky I'm experiencing the same behaviour. Did you find out why this occurs?

It's still there

nikkon-dev commented 3 years ago

Hi,

Could you provide the logs from the dcgm-exporter itself? It looks like there are two dcgm-exporter instances one aware of k8s environment (were able to connect to pod api) and another one didn't. The container_name, pod_namespace, pod_name labels are gathered from the k8s infra and if there are no such labels - connection to the k8s from the dcgm-exporter failed and that should be reflected in the dcgm-exporter logs.

WBR, Nik