NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.71k stars 980 forks source link

[QST] Epilogue Swizzle #1383

Open jeromeku opened 8 months ago

jeromeku commented 8 months ago

What is your question? I'm interested in extending Example 50: Hopper Gemm with Epilogue Swizzle to Ampere architectures and am trying to understand how swizzled SmemLayout avoids both bank conflicts and enables vectorized copies from shared to registers.

Specifically, as configured here, and implemented here, in the epilogue, the accumulators from the mainloop are copied to shared memory, which is laid out using an SmemLayout and then copied in (ideally) vectorized fashion to registers partitioned for performing the epilogue op and writing back out to global memory.

How does swizzling affect both bank conflict on writing from registers -> smem and then vectorized smem -> registers copy specifically during the epilogue? I get how swizzled layouts are necessary for bank-conflict free ldmatrix during the mainloop but am less clear the effect on copies in the epilogue where the results from mma.sync are first written back to shared memory and then copied back out to registers and eventually to global memory.

thakkarV commented 8 months ago

The purpose of swizzling is the same no matter where it is used, mainloop or the epi. In the mainloop we swizzle the smem such that the layout of threads copying the data from gmem->smem and the layout of threads copying the data from smem->rmem can both partition it "nicely" (allowing us to copy with vectorized instructions in a bank conflict free manner)

In the case of the epi, this is the same -- we want the store to gmem be vectorized and bank conflict free, and staged ld/st from smem to also be vectorized and bank conflict free. Fundamentally, swizzling is the solution to optimized accesses from two distinct partitioners accessing the same data tensor.

jeromeku commented 8 months ago

Thanks @thakkarV.

I realize that the purpose of swizzling regardless of mainloop or epilogue.

I'm having trouble visualizing how the partitioning after the mainloop can be done in a bank-conflict / vectorized way.

More specifically, after the mainloop, for a single 16x8x16 tensorcore atom, the data is distributed as such across threads: Screenshot from 2024-03-06 10-09-17, with each thread owning -- assuming fp16 -- 4 elements stored across 2 uint32_t registers.

When we partition these registers to be copied to shared memory in the epilogue, each thread in a warp can write 32 bits at a time to a location in shared memory such that the requisite kTileM x kTileN shape is obtained. Where are the bank conflicts occurring during this phase -- assuming no swizzling and a simple kTileM x kTileN row major shared memory layout -- and subsequently, when copying from shared memory back to registers for the epilogue op / copy to gmem?

I understand how during ldmatrix when each thread is accessing 128 bits from shared memory that bank conflicts can arise since each "row" that a thread is loading falls within the bank(s) but am less clear given the data distribution in the epilogue.

Thanks for any clarification.

HanGuo97 commented 8 months ago

Hi, I have a loosely related question about the vectorized Epilogue. What are the general rule of thumb/guideline when configuring the SmemLayout, as well as the tiled copy between Smem and registers (TiledCopyS2R)?

I have been going through the example in these lines, but the numbers/choices are seemingly arbitrary to a newbie like me.

Thanks in advance for your time!

github-actions[bot] commented 7 months ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 4 months ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.