Closed hongpeng-guo closed 1 month ago
@hongpeng-guo, The overhead of those metrics (and DCP metrics in general) is very low. They are disabled by default mainly because they may confuse users and require a better understanding of their meaning.
The overhead of those metrics (and DCP metrics in general) is very low. They are disabled by default mainly because they may confuse users and require a better understanding of their meaning.
Thanks a lot for the prompt rely!! Problem solved.
Ask your question
I noticed that the metrics
DCGM_FI_PROF_SM_ACTIVE
andDCGM_FI_PROF_SM_OCCUPANCY
are muted by default indcgm-exporter
. One likely reason could be the overhead introduced when monitoring these lower-level SM activities.Does anyone have an estimate or experience with the actual performance impact (e.g., in terms of GPU overhead) when enabling these two metrics? While I understand the potential overhead, these metrics are quite valuable for monitoring SM utilization, so I'd like to weigh the trade-offs.
https://github.com/NVIDIA/dcgm-exporter/blob/402a10fd8bb4a36be7cc5b2c703cf8f1322d1ef0/etc/dcp-metrics-included.csv#L81-L82