NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
833 stars 150 forks source link

the meaning of SM occupancy #283

Open CCrainys opened 5 months ago

CCrainys commented 5 months ago

Hi, DCGM team

I am using the DCGM tool to profile my GPU job.

The result showed like below:

image

The SM occupancy is defined as "The ratio of number of warps resident on an SM".

I am a bit confused about which SM the output pointed to. Because there are multiple SMs in one GPU, correct?

For example, in my test, the SM occupancy is around 35%. Does it mean the average SM occupancy for all SMs in GPU?

nikkon-dev commented 5 months ago

Roughly, the meaning of the SM Occupancy could be described by this documentation: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm

CCrainys commented 5 months ago

Thanks for your reply. But I still don't fully understand the meaning of SM occupancy in the DCGM context...

Quote from the website "Occupancy is defined as the ratio of active warps on an SM to the maximum number of active warps supported by the SM"

I read this document before. From my understanding, this document describes the SM occupancy of the Nsight system, which is measured for each CUDA kernel independently. For Nsight system, I can understand the meaning of SM occupancy because nsight system reports occupancy at kernel level, and each kernel is only scheduled to one SM.

However, the DCGM output reported the SM occupancy for each GPU; what does GPU-wide SM occupancy mean? I cannot understand because GPU has multiple SMs. Directly reporting the SM at GPU level makes me confused. Does it refer to the average SM occupancy over all SMs of one GPU?

Could you please explain it further? Thanks in advance