Bug with DCGM_FI_DEV_VGPU_INSTANCE_IDS metric

Deezzir commented 1 month ago

What is the version?

3.3.8

What happened?

When the DCGM_FI_DEV_VGPU_INSTANCE_IDS metric is enabled, querying the endpoint will give the following result for it:

DCGM_FI_DEV_VGPU_INSTANCE_IDS{gpu="0"} ERROR - FAILED TO CONVERT TO STRING

What did you expect to happen?

The description for the metrics as per the docs:

Includes Count and currently Active vGPU Instances on a device

It seems like it should be a counter, so int or float. Why is it being converted to string?

What is the GPU model?

It happens both on NVIDIA GeForce RTX 3070 and NVIDIA Tesla V100-DGXS-16GB platforms

What is the environment?

Both systems had nvidia-driver-550 installed. I can provide other environmental information if you'd like it.

How did you deploy the dcgm-exporter and what is the configuration?

Default configuration deployed as a snap, only a custom metric CSV is provided.

How to reproduce the issue?

Enable the metric and query the endpoint.

Anything else we need to know?

No response

glowkey commented 1 month ago

Unfortunately that field is incompatible with DCGM-Exporter as it returns an array of values that cannot be exported to prometheus.

Deezzir commented 1 month ago

How do I tell if a metric is compatible or not?

glowkey commented 1 month ago

I believe the only definitive way is to find the metric in this file to determine if it is DCGM_FT_STRING or DCGM_FT_BINARY, which are incompatible: https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/src/dcgm_fields.cpp

Deezzir commented 1 month ago

Thank you!

gabrielcocenza commented 1 month ago

@glowkey would be possible to have an exhaustive list on DCGM-exporter page with the metrics that are supported in Prometheus?

glowkey commented 1 month ago

That is a useful request, thanks for the suggestion! We will add it to our backlog.

mahendrapaipuri commented 1 month ago

@glowkey Sorry to comment on a closed issue. Just a question: is it possible to export this metric "array" as separate metrics using vGPU UUID as a label? It would be great if we can get vGPU metrics (like utilisation, etc) directly from hypervisor using UUID with prefix, say vGPU.

NVIDIA / dcgm-exporter