On Ampere, we schedule split-K by first unswizzling to smem (usually) then each CTA in turn performs vectorized stores and loads to a temporary buffer in a deterministic order to perform the sum.
Hopper introduces a mechanism to simplify this whole process with a modified TMA instruction: cp.reduce.async.bulk. This instruction copies data from smem to gmem and also performs the specified reduction operation. For split-K, we can use the regular TMA copy for the first CTA, then after that use this version to update the result. The last block will need to load the data back to registers in order to compute the epilogue. That can also be done with a TMA copy back to smem followed by ldmatrix, but we can also use vectorized load if necessary.
Note that cp.reduce.async.bulk.add.f16 does casting to float for us so we could easily support half-precision reduction on Hopper without having to mess with vectorization factors manually (see #1719).
Syncing
Ampere split-K does synchronous copies using volatile vectorized accesses. That makes it easy to do the synchronization required to serialize blocks. We can use the same grid serialization code, but we must avoid releasing the semaphore while asynchronous writes are still in flight. For this we need to use cp.async.bulk.commit_group and cp.async.bulk.wait_group 0 just as we do while circular-buffering loads with TMA. We can just insert kir::AsyncCommit and kir::AsyncAwait nodes just before the kir::BlockSerializeRelease node.
Plan
I think the following things would be needed:
Extend TensorView::requestSerialGridReduction() to also specify whether our existing method is used or the new TMA-based method.
When lowering, we should allocate our work buffer in the same way we do currently, but we need to update the sync insertion stage to add the cp.async nodes as mentioned above.
3.In the IR create predicates for the first and last blocks in the reduction segment and generate IfElse structure that avoids adding on the first block and last block.
On Ampere, we schedule split-K by first unswizzling to smem (usually) then each CTA in turn performs vectorized stores and loads to a temporary buffer in a deterministic order to perform the sum.
Hopper introduces a mechanism to simplify this whole process with a modified TMA instruction: cp.reduce.async.bulk. This instruction copies data from smem to gmem and also performs the specified reduction operation. For split-K, we can use the regular TMA copy for the first CTA, then after that use this version to update the result. The last block will need to load the data back to registers in order to compute the epilogue. That can also be done with a TMA copy back to smem followed by ldmatrix, but we can also use vectorized load if necessary.
Note that
cp.reduce.async.bulk.add.f16
does casting to float for us so we could easily support half-precision reduction on Hopper without having to mess with vectorization factors manually (see #1719).Syncing
Ampere split-K does synchronous copies using volatile vectorized accesses. That makes it easy to do the synchronization required to serialize blocks. We can use the same grid serialization code, but we must avoid releasing the semaphore while asynchronous writes are still in flight. For this we need to use
cp.async.bulk.commit_group
andcp.async.bulk.wait_group 0
just as we do while circular-buffering loads with TMA. We can just insertkir::AsyncCommit
andkir::AsyncAwait
nodes just before thekir::BlockSerializeRelease
node.Plan
I think the following things would be needed:
TensorView::requestSerialGridReduction()
to also specify whether our existing method is used or the new TMA-based method.