NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

[Question] Understanding multiplexing of profiling counters #169

Closed jaywonchung closed 2 months ago

jaywonchung commented 2 months ago

The documentation mentions that some DCGM_FI_PROF metrics cannot be sampled at the same time from the device, and when they are asked by the user in the same field group, multiplexing will happen and sometimes zeros will be returned.

Is there a way for me to distinguish whether the value of a metric is actually zero, or was just returned as zero due to multiplexing?

nikkon-dev commented 2 months ago

@jaywonchung,

Due to the nature of data storage, there is no way to tell between these two situations. However, you should not see such side effects until you request sampling at a high frequency (like 10KHz and higher).

Such high frequencies usually indicate a wrong use case for DCGM, which is a monitoring tool. At such high frequencies, you should use actual profiling tools like nvprof or nsight compute.

jaywonchung commented 2 months ago

Ah I see. I was sampling at 10~20 Hz and I guess that would be fine. Thanks for your answer!