Open jaywonchung opened 2 months ago
I also observe a lag ~ 40 seconds of the metrics reported by DCGM-Exporter. I do not observe such a long lagging of the metrics reported by cAdvisor on the same platform ,though.
I found the lag could be reduced by setting its time of metric collection interval ($DCGM_EXPORTER_INTERVAL
, or -c
). The default is 30000
ms.
https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html
I'm drawing a timeline of DCGM metrics (gathered with
DcgmReader
and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM metrics have their own microsecond timestamp, and I gathered the timestamp of appilcation-level metrics withtime.time_ns() // 1000
.I see that in the beginning of the application, the SM activity metric (which I'm taking as "at least one kernel is running on the GPU") becomes non-zero something like 300 ms after dispatching the first batch of computations to the GPU. That delay is way too long for any kernel launch overhead or the Python interpreter overhead. I don't think it would be cache misses either since the DRAM activity metric goes up at the same moment SM activity goes up.
DCGM_FI_DEV_*
group and theDCGM_FI_PROF_*
group?