Closed Deezzir closed 1 month ago
Unfortunately that field is incompatible with DCGM-Exporter as it returns an array of values that cannot be exported to prometheus.
How do I tell if a metric is compatible or not?
I believe the only definitive way is to find the metric in this file to determine if it is DCGM_FT_STRING or DCGM_FT_BINARY, which are incompatible: https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/src/dcgm_fields.cpp
Thank you!
@glowkey would be possible to have an exhaustive list on DCGM-exporter page with the metrics that are supported in Prometheus?
That is a useful request, thanks for the suggestion! We will add it to our backlog.
@glowkey Sorry to comment on a closed issue. Just a question: is it possible to export this metric "array" as separate metrics using vGPU UUID as a label? It would be great if we can get vGPU metrics (like utilisation, etc) directly from hypervisor using UUID with prefix, say vGPU
.
What is the version?
3.3.8
What happened?
When the
DCGM_FI_DEV_VGPU_INSTANCE_IDS
metric is enabled, querying the endpoint will give the following result for it:What did you expect to happen?
The description for the metrics as per the docs:
It seems like it should be a counter, so
int
orfloat
. Why is it being converted to string?What is the GPU model?
It happens both on NVIDIA GeForce RTX 3070 and NVIDIA Tesla V100-DGXS-16GB platforms
What is the environment?
Both systems had
nvidia-driver-550
installed. I can provide other environmental information if you'd like it.How did you deploy the dcgm-exporter and what is the configuration?
Default configuration deployed as a snap, only a custom metric CSV is provided.
How to reproduce the issue?
Enable the metric and query the endpoint.
Anything else we need to know?
No response