NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

[Question] Amount of lag expected for metrics #170

Open jaywonchung opened 2 months ago

jaywonchung commented 2 months ago

I'm drawing a timeline of DCGM metrics (gathered with DcgmReader and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM metrics have their own microsecond timestamp, and I gathered the timestamp of appilcation-level metrics with time.time_ns() // 1000.

I see that in the beginning of the application, the SM activity metric (which I'm taking as "at least one kernel is running on the GPU") becomes non-zero something like 300 ms after dispatching the first batch of computations to the GPU. That delay is way too long for any kernel launch overhead or the Python interpreter overhead. I don't think it would be cache misses either since the DRAM activity metric goes up at the same moment SM activity goes up.

  1. In general, how long of a lag should I expect for DCGM metrics?
  2. Can different metrics have different amounts of lag, if any? Largely, can the lag be different for the DCGM_FI_DEV_* group and the DCGM_FI_PROF_* group?
george-kuanli-peng commented 1 month ago

I also observe a lag ~ 40 seconds of the metrics reported by DCGM-Exporter. I do not observe such a long lagging of the metrics reported by cAdvisor on the same platform ,though.

george-kuanli-peng commented 2 weeks ago

I found the lag could be reduced by setting its time of metric collection interval ($DCGM_EXPORTER_INTERVAL, or -c). The default is 30000 ms.

https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html