NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

dcgm-exporter counter value goes down #417

Open luccabb opened 5 days ago

luccabb commented 5 days ago

What is the version?

3.1.3-3.1.2

What happened?

I'm tracking DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, this value is marked as a counter. but I'm observing the value going down, i.e.:

timestamp t
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1"...} 57

timestamp t + 1
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1"...} 21

I'm also seeing this with other metrics that are reported as counters here. Is this expected behavior for counters?

What did you expect to happen?

I'ld expect for counters to only go up.

Do not use a counter to expose a value that can decrease.

source: https://prometheus.io/docs/concepts/metric_types/#counter

What is the GPU model?

A100-SXM4-80GB

What is the environment?

bare metal

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

luccabb commented 5 days ago

if this is expected behavior, should we change the type to gauge?

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

https://prometheus.io/docs/concepts/metric_types/#gauge