Open Levi080513 opened 8 months ago
Currently we use DCGM_FI_DEV_GPU_TEMP to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.
DCGM_FI_DEV_GPU_TEMP
https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L784
https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L761
Can you try to use other metrics available on your vGPU?
DCGM_FI_DEV_GPU_UTIL metrics is work well.
DCGM_FI_DEV_GPU_UTIL
Can I submit a PR to fix it?
@Levi080513 , sure you can submit PRs; we appreciate community contribution.
Currently we use
DCGM_FI_DEV_GPU_TEMP
to obtain the instance/GPU list, but this metrics is not collected in vGPU clusters. This will prevent the dashboard from displaying properly.https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L784
https://github.com/NVIDIA/dcgm-exporter/blob/30d4ddcae9c7153c31dd35301aa4a1f3b90a2096/grafana/dcgm-exporter-dashboard.json#L761