NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

How to monitor occupancy per SM. #196

Open malixian opened 3 years ago

malixian commented 3 years ago

Like SM0:90% SM1:90% .... SM16:0%, SM17:0% ....

bstollenvidia commented 3 years ago

Hi @malixian , per-SM profiling metrics are not possible with DCGM.

chenwenyan commented 3 years ago

Hi @malixian , per-SM profiling metrics are not possible with DCGM.

Hi, could you plz explain what is the meaning of sm(%), the output when I use 'nvidia-smi pmon'?Is it the average value of all SMs occupancy?

bstollenvidia commented 3 years ago

Yes. It's the average across all SMs. There are 3 dimensions of untilizations:

gr_activity (1001) - Is any kernel running on any SM. Using 1 block with 1 thread = 100%. sm_activity (1002) - Is any kernel running on the SMs. Using numSMs blocks with 1 thread = 100%. Averaged across SMs sm_occupancy (1003) - How many warps ran vs theoretical max numSMs blocks with 64 threads = 100%. Averaged across SMs

chenwenyan commented 3 years ago

Yes. It's the average across all SMs. There are 3 dimensions of untilizations:

gr_activity (1001) - Is any kernel running on any SM. Using 1 block with 1 thread = 100%. sm_activity (1002) - Is any kernel running on the SMs. Using numSMs blocks with 1 thread = 100%. Averaged across SMs sm_occupancy (1003) - How many warps ran vs theoretical max numSMs blocks with 64 threads = 100%. Averaged across SMs

Many thanks for your reply. However, when I use different batch sizes to train DL jobs such as VGG16, the sm(%) decreases when batch size increases. (when i set batch size = 2, the sm(%) is up to 97%; but when i set batch size = 4096, the sm(%) is about 60%) Do you have any ideas on this?

bstollenvidia commented 3 years ago

Sorry. I'm not familiar with tuning DL jobs. I can help with GPU monitoring questions though