Open jeromeku opened 8 months ago
The purpose of swizzling is the same no matter where it is used, mainloop or the epi. In the mainloop we swizzle the smem such that the layout of threads copying the data from gmem->smem and the layout of threads copying the data from smem->rmem can both partition it "nicely" (allowing us to copy with vectorized instructions in a bank conflict free manner)
In the case of the epi, this is the same -- we want the store to gmem be vectorized and bank conflict free, and staged ld/st from smem to also be vectorized and bank conflict free. Fundamentally, swizzling is the solution to optimized accesses from two distinct partitioners accessing the same data tensor.
Thanks @thakkarV.
I realize that the purpose of swizzling regardless of mainloop or epilogue.
I'm having trouble visualizing how the partitioning after the mainloop can be done in a bank-conflict / vectorized way.
More specifically, after the mainloop, for a single 16x8x16
tensorcore atom, the data is distributed as such across threads:
, with each thread owning -- assuming fp16
-- 4 elements stored across 2 uint32_t
registers.
When we partition these registers to be copied to shared memory in the epilogue, each thread in a warp can write 32 bits at a time to a location in shared memory such that the requisite kTileM x kTileN
shape is obtained. Where are the bank conflicts occurring during this phase -- assuming no swizzling and a simple kTileM x kTileN
row major shared memory layout -- and subsequently, when copying from shared memory back to registers for the epilogue op / copy to gmem?
I understand how during ldmatrix
when each thread is accessing 128 bits from shared memory that bank conflicts can arise since each "row" that a thread is loading falls within the bank(s) but am less clear given the data distribution in the epilogue.
Thanks for any clarification.
Hi, I have a loosely related question about the vectorized Epilogue. What are the general rule of thumb/guideline when configuring the SmemLayout
, as well as the tiled copy between Smem and registers (TiledCopyS2R
)?
I have been going through the example in these lines, but the numbers/choices are seemingly arbitrary to a newbie like me.
Thanks in advance for your time!
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
What is your question? I'm interested in extending
Example 50: Hopper Gemm with Epilogue Swizzle
toAmpere
architectures and am trying to understand how swizzledSmemLayout
avoids both bank conflicts and enablesvectorized
copies from shared to registers.Specifically, as configured here, and implemented here, in the epilogue, the
accumulators
from the mainloop are copied to shared memory, which is laid out using anSmemLayout
and then copied in (ideally) vectorized fashion to registers partitioned for performing theepilogue
op and writing back out to global memory.How does swizzling affect both bank conflict on writing from
registers -> smem
and then vectorizedsmem -> registers
copy specifically during theepilogue
? I get how swizzled layouts are necessary for bank-conflict freeldmatrix
during themainloop
but am less clear the effect on copies in theepilogue
where the results frommma.sync
are first written back to shared memory and then copied back out to registers and eventually to global memory.