Load mma operands to shared memory with TMA

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Other

271 stars 53 forks source link

Load mma operands to shared memory with TMA #3320

Closed rdspring1 closed 1 week ago

rdspring1 commented 3 weeks ago

This PR modifies schedulePrologues to use TMA loads to move mma operands to shared memory. Stacked on https://github.com/NVIDIA/Fuser/pull/3324 and https://github.com/NVIDIA/Fuser/pull/3310.

Details

Input operands are loaded into shared memory via CpAsyncBulkTensorTile LoadStoreOp.
Replace LdMatrix operation with basic set.
Modified scheduleOperandSmemStores to apply swizzling to avoid bank conflicts.
Refactor swizzleSharedMemory by moving the analysis component to a separate function named analyzeSwizzleSharedMemory.
Create tmaSwizzleSharedMemory function that uses analyzeSwizzleSharedMemory and then finds the appropriate tma swizzle format.
Disable loop rotation. There is an issue with tma loads and circular buffering. Not sure if loop rotation is required for hopper matmul.
Expect hopper matmul tests to give incorrect results.

rdspring1 commented 2 weeks ago

!test

rdspring1 commented 2 weeks ago

how will we handle partial vectorization?

Do you mean when the tensor is not 16B aligned? You can overcopy with TMA, cp.async, or regular LDG + STS.

jacobhinkle commented 2 weeks ago

how will we handle partial vectorization?

Do you mean when the tensor is not 16B aligned? You can overcopy with TMA, cp.async, or regular LDG + STS.

Yeah exactly. So if we had K=60 and that is the inner dimension of each of the operands, in the Ampere scheduler we need to handle them differently when we generate the kernel since we can only do 4-element reads for the cp.async call then in stead of 8-element reads. But I don't see where that kind of alignment analysis comes in when using TMA; will TMA handle misaligned boxes dynamically using the same compiled kernel as for fully-aligned inputs?

EDIT: is this computed on the host side in the TMA descriptor?

rdspring1 commented 2 weeks ago

TMA should automatically handle the case when K=60 by filling the out-of-bounds accesses. If the tensor is not 16B aligned, TMA will fail and you need to use regular LDG + STS accesses.

rdspring1 commented 2 weeks ago

!test

jacobhinkle commented 2 weeks ago

Looks like you just need to guard AmpereMatmulBroadcastBatch. I noticed I needed this in #3278 but I was too lazy to merge that upstream to this PR for you. https://github.com/NVIDIA/Fuser/pull/3278/files#diff-64fc4e7bfbc5b9f95ac3dc5823bd99b683b048926805c13310ce6a8ef8032289R147-R148

rdspring1 commented 2 weeks ago

!test