Open KuangjuX opened 1 month ago
Back2Back GEMM is similar to GEMM, but at the register level, it first performs matrix multiplication on the two input matrices A and B, and then performs matrix multiplication on the third matrix. It's worth noting that the mapping of matrices A, B, and C is different. For matrices A and B, the k dimension needs to be split over time, while for matrix C, the p dimension is mapped to the thread block, resulting in a different nested loop structure.
Back2Back GEMM is an important kernel, and it is the core of flash attention, so it is necessary to analyze its dataflow and generate it with the help of the dataflow.