Closed zasdfgbnm closed 2 days ago
!test
ClonePipelinedTmaCircularBufferLoopAndInsertSync
almost feels like a separate pass, since it handles mbarrier synchronization with loop cloning.
So, warp specialization is derived from circular buffering. e.g., clone a loop into load and compute phases. skip cloning unnecessary expressions between each phase.
ClonePipelinedTmaCircularBufferLoopAndInsertSync
almost feels like a separate pass, since it handles mbarrier synchronization with loop cloning.So, warp specialization is derived from circular buffering. e.g., clone a loop into load and compute phases. skip cloning unnecessary expressions between each phase.
Yes, correct
Today, we are generating kernel IRs like:
Although there is nothing wrong in the generated code, the kernel IR does not make sense, because the wait should occur outside the MMA loop.
Indeed, due to the nature we determine the circular buffer axis, it is guaranteed that the TMA block and all syncs should occur in the circular buffer loop. So I just remove
onlyOneSerialForLoopOnStack
and check if the circular buffer loop is the only loop on stack.