NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

kill `onlyOneSerialForLoopOnStack` #3442

Closed zasdfgbnm closed 2 days ago

zasdfgbnm commented 3 days ago

Today, we are generating kernel IRs like:

FOR i1 in MMA:
  FOR i2 in MMA:
    FOR i3 in MMA:
      wait parity
      mma

Although there is nothing wrong in the generated code, the kernel IR does not make sense, because the wait should occur outside the MMA loop.

Indeed, due to the nature we determine the circular buffer axis, it is guaranteed that the TMA block and all syncs should occur in the circular buffer loop. So I just remove onlyOneSerialForLoopOnStack and check if the circular buffer loop is the only loop on stack.

zasdfgbnm commented 3 days ago

!test

rdspring1 commented 2 days ago

ClonePipelinedTmaCircularBufferLoopAndInsertSync almost feels like a separate pass, since it handles mbarrier synchronization with loop cloning.

So, warp specialization is derived from circular buffering. e.g., clone a loop into load and compute phases. skip cloning unnecessary expressions between each phase.

zasdfgbnm commented 2 days ago

ClonePipelinedTmaCircularBufferLoopAndInsertSync almost feels like a separate pass, since it handles mbarrier synchronization with loop cloning.

So, warp specialization is derived from circular buffering. e.g., clone a loop into load and compute phases. skip cloning unnecessary expressions between each phase.

Yes, correct