NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
885 stars 154 forks source link

NVIDIA DCGM Exporter Dashboard does not work in vGPU cluster #236

Open Levi080513 opened 8 months ago

Levi080513 commented 8 months ago

Currently we use DCGM_FI_DEV_GPU_TEMP to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.

https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L784

https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L761

nvvfedorov commented 8 months ago

Can you try to use other metrics available on your vGPU?

Levi080513 commented 8 months ago

DCGM_FI_DEV_GPU_UTIL metrics is work well.

Levi080513 commented 8 months ago

Can I submit a PR to fix it?

nvvfedorov commented 8 months ago

@Levi080513 , sure you can submit PRs; we appreciate community contribution.