Open luccabb opened 5 days ago
3.1.3-3.1.2
I'm tracking DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, this value is marked as a counter. but I'm observing the value going down, i.e.:
timestamp t DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1"...} 57 timestamp t + 1 DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1"...} 21
I'm also seeing this with other metrics that are reported as counters here. Is this expected behavior for counters?
counters
I'ld expect for counters to only go up.
Do not use a counter to expose a value that can decrease.
source: https://prometheus.io/docs/concepts/metric_types/#counter
A100-SXM4-80GB
bare metal
No response
if this is expected behavior, should we change the type to gauge?
A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
https://prometheus.io/docs/concepts/metric_types/#gauge
What is the version?
3.1.3-3.1.2
What happened?
I'm tracking DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, this value is marked as a counter. but I'm observing the value going down, i.e.:
I'm also seeing this with other metrics that are reported as
counters
here. Is this expected behavior for counters?What did you expect to happen?
I'ld expect for counters to only go up.
source: https://prometheus.io/docs/concepts/metric_types/#counter
What is the GPU model?
A100-SXM4-80GB
What is the environment?
bare metal
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response