Open frittentheke opened 3 weeks ago
@frittentheke, Thank you for reporting about the issue. Am I right that the main request is the following: Metrics that are available per-subdevice should be returned; if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.
Thanks @nvvfedorov for your fast response!
@frittentheke, Thank you for reporting about the issue. Am I right that the main request is the following: Metrics that are available per-subdevice should be returned; if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.
Yes. Please also see my PR (https://github.com/NVIDIA/dcgm-exporter/pull/355) in which I (had to) apply aggregations like max()
to work around this for the dashboard. If you'd consider removing those duplicated metrics, I gladly simplify the PromQL queries for the dashboard / my PR (again).
If I may add another matter to my findings (which I also hit during the dashboard rework) - the exported labels are somewhat mixed-case with about all variants possible: Hostname
vs. DCGM_FI_DRIVER_VERSION
vs gpu
vs. modelName
. Please also consider cleaning this up. Especially when trying to join time-series via a set of labels it's really painful.
What is the version?
3.3.5-3.4.1
What happened?
When activating MIG we saw duplicated and plain wrong metrics in the provided Grafana dashboard (https://github.com/NVIDIA/dcgm-exporter/tree/main/grafana).
The issue seems to be two-fold, with Grafana as well as the raw metrics themselves:
Firstly the dashboard: Legends, ... and PromQL queries used to fetch metrics do not take MIG into account. So metrics returning MIG subdevices (
GPU_I_ID
) are not considered. GPU metrics regarding have not been upSecondly the metrics:
max()
,avg()
orsum()
to avoid duplication, there are some metrics reported back perGPU_I_ID
, that do not have this granularity. See me comment https://github.com/NVIDIA/dcgm-exporter/issues/257#issuecomment-2210537130. So if the power draw is not measured perGPU_I_ID
you cannot return it individually as you would be returning false values.DCGM_FI_PROF_*
.What did you expect to happen?
Provided MIG and other ways of partitioning GPUs (vGPU, time-slicing, ...) is quite common, I'd expect the exporter and the provided dashboard to take those into account.
Metrics that are available per-subdevice should be returned, if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.
What is the GPU model?
H100s, using different MIG profiles and whole GPUs
What is the environment?
Kubernetes
How did you deploy the dcgm-exporter and what is the configuration?
Kubernetes with GPU-Operator
How to reproduce the issue?
Enable MIG on a GPU and look at the dashboard.
Anything else we need to know?
There are multiple issues with DCGM or the operator open: