NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
934 stars 163 forks source link

Overhead of Enabling `DCGM_FI_PROF_SM_ACTIVE` and `DCGM_FI_PROF_SM_OCCUPANCY` Metrics #400

Closed hongpeng-guo closed 1 month ago

hongpeng-guo commented 1 month ago

Ask your question

I noticed that the metrics DCGM_FI_PROF_SM_ACTIVE and DCGM_FI_PROF_SM_OCCUPANCY are muted by default in dcgm-exporter. One likely reason could be the overhead introduced when monitoring these lower-level SM activities.

Does anyone have an estimate or experience with the actual performance impact (e.g., in terms of GPU overhead) when enabling these two metrics? While I understand the potential overhead, these metrics are quite valuable for monitoring SM utilization, so I'd like to weigh the trade-offs.

https://github.com/NVIDIA/dcgm-exporter/blob/402a10fd8bb4a36be7cc5b2c703cf8f1322d1ef0/etc/dcp-metrics-included.csv#L81-L82

nikkon-dev commented 1 month ago

@hongpeng-guo, The overhead of those metrics (and DCP metrics in general) is very low. They are disabled by default mainly because they may confuse users and require a better understanding of their meaning.

hongpeng-guo commented 1 month ago

The overhead of those metrics (and DCP metrics in general) is very low. They are disabled by default mainly because they may confuse users and require a better understanding of their meaning.

Thanks a lot for the prompt rely!! Problem solved.