Closed alexarmbr closed 5 months ago
A well designed swizzle layout, tiled copy, and mma should not have any extra indexing calculations. Can you please post your code? Or you can consult the SM80/SM70 mainloops we have in the 3.x collectives
here is the code. Right now my kernel is compute bound (no longer memory bound) and I think all these integer instructions are crowding out the issuing of tensor core instructions. I am using mma.sync.m16n8k8
. I was originally going to write everything from scratch, but the index calculations/swizzling for the several layers of tiling got really tedious so I decided to use cute layouts. Basically I am storing the data in a shared memory in a tensor with a swizzled layout, and accessing each 16x8 / 8x8 tile like this inside the inner loop:
mma_tiles(make_coord(_,_), make_coord(make_coord(mma_k, mma_n), make_coord(warp_k, warp_n)));
I think I may write the index calculations for the inner loop from scratch to make sure I understand exactly what is going on.
Code link is dead
apologies try now!
The loop induction variables you're using as coordinates into the tensors are dynamic. This will lead to dynamic address generation code.
For educational purposes I am working on writing an fp16 GEMM kernel that is as performant as cuBLAS HGEMM. I am using CuTe tensors/layouts handle the index calculations and shared memory swizzling. Each thread is accessing a different row of a 16x8 swizzled tile of data each iteration of the inner loop. My kernel is fairly performant (~35Gflop/s vs ~52GFlop/s for cuBLAS, this is on turing tensor cores). roughly like this, and then each thread performs one more slicing operation:
Tensor B_mma_tile = B_mma_tiles(make_coord(_,_), make_coord(mma_k, mma_n, warp_k, warp_n));
However nsight compute suggests the biggest performance issue with my kernel is overhead from index calculations, lots of integer instructions are crowding out the issuing of memory loads and tensor core instructions.
In order to reach cuBLAS performance, would you recommend I remove the CuTe tensor from the inner loop of my kernel? My thought was to manually implement a swizzling pattern such that I can advance each thread to the next mma tile with a single
xor
(see here)