NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.36k stars 903 forks source link

[QST]Why we use setmaxnreg? Does this change register/Occupancy? #1676

Open ziyuhuang123 opened 1 month ago

ziyuhuang123 commented 1 month ago

setmaxnreg is a new feature since Hopper. I noticed this in cutlass: https://github.com/NVIDIA/cutlass/blob/eee0cab26c8eedea447eb3b58b3498eeba2294da/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp#L446 From above, the consumer register is 232, the producer register is 40. Different warp can use different register number??? This will affect Occupancy. Also, we can dynamicaly modify register during kernel running? This will affect Occupancy. Using NCU, I find a static register number 168, not 232, not 40. Anything wrong? 56874582e553d9cf4bcd455b9332988

MARD1NO commented 1 month ago

maybe it use 1 load warpgroup and 2 mma warp group

then: register is = (1 40 + 2 232) / 3 = 168? I guess

ziyuhuang123 commented 1 month ago

I agree. So the number of register is determined by:

  1. code shared by different warp (A registers)
  2. code snippet for warp(i, i+1.., i+K) (B registers)------code snippet for warp(j, j+1, ..., j+T)(C registers)

In part 2, the register is num_2 = (KB+CT)/(B+C)

So overall register number is max(A, num_2)?

github-actions[bot] commented 1 week ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.