NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

Use TMA with reduction for Hopper Split-K #3383

Closed jacobhinkle closed 1 week ago

jacobhinkle commented 1 week ago

On Ampere, we schedule split-K by first unswizzling to smem (usually) then each CTA in turn performs vectorized stores and loads to a temporary buffer in a deterministic order to perform the sum.

Hopper introduces a mechanism to simplify this whole process with a modified TMA instruction: cp.reduce.async.bulk. This instruction copies data from smem to gmem and also performs the specified reduction operation. For split-K, we can use the regular TMA copy for the first CTA, then after that use this version to update the result. The last block will need to load the data back to registers in order to compute the epilogue. That can also be done with a TMA copy back to smem followed by ldmatrix, but we can also use vectorized load if necessary.

Note that cp.reduce.async.bulk.add.f16 does casting to float for us so we could easily support half-precision reduction on Hopper without having to mess with vectorization factors manually (see #1719).

Syncing

Ampere split-K does synchronous copies using volatile vectorized accesses. That makes it easy to do the synchronization required to serialize blocks. We can use the same grid serialization code, but we must avoid releasing the semaphore while asynchronous writes are still in flight. For this we need to use cp.async.bulk.commit_group and cp.async.bulk.wait_group 0 just as we do while circular-buffering loads with TMA. We can just insert kir::AsyncCommit and kir::AsyncAwait nodes just before the kir::BlockSerializeRelease node.

Plan

I think the following things would be needed:

  1. Extend TensorView::requestSerialGridReduction() to also specify whether our existing method is used or the new TMA-based method.
  2. When lowering, we should allocate our work buffer in the same way we do currently, but we need to update the sync insertion stage to add the cp.async nodes as mentioned above. 3.In the IR create predicates for the first and last blocks in the reduction segment and generate IfElse structure that avoids adding on the first block and last block.
naoyam commented 1 week ago

Maybe something to consider for reductions and normalizations too? @liqiangxl

jacobhinkle commented 1 week ago

On hold for now because we will not likely get to this by year end but we should shoot for Q1.