KnowingNothing / MatmulTutorial

A Easy-to-understand TensorOp Matmul Tutorial
Apache License 2.0
286 stars 30 forks source link

TensorOp Matmul Tutorial

This is an example repo for CUDA MatMul implementation. The aim of this repo is to provide some insights in high-performance kernel design for CUDA beginners. Currently, I only provide some implementation examples in examples/matmul/this. Contributions for more kernels and other MatMul implementations are highly welcomed.

About

There is a detailed explanation about the different versions of MatMul kernels in examples/matmul/this.

Contents

Performance Results

Performance on H800 GPU

image The current version only achieves on average 70% performance of CuBLAS. I am still working on improving the performance.

Performance on A100 GPU

A100-GEMM-perf

The overall performance comparison among Relay, CuBLAS, CUTLASS, TensorIR, Triton, and our implementations. The y-axis is speedup to Relay+CUTLASS.

Overall, the geometric mean speedup to Relay+CUTLASS is 1.73x, to TensorIR (1000 tuning trials using MetaSchedule per case) is 1.22x, to CuBLAS is 1.00x, to CUTLASS is 0.999x, to Triton is 1.07x. The 61 shapes are:

No. M N K
1 5376 5376 2048
2 5376-128 5376 2048
3 5376-2*128 5376 2048
... ... ... ...
11 5376-10*128 5376 2048
12 5376+128 5376 2048
13 5376+2*128 5376 2048
... ... ... ...
21 5376+10*128 5376 2048
22 5376 5376-128 2048
23 5376 5376-2*128 2048
... ... ... ...
31 5376 5376-10*128 2048
32 5376 5376+128 2048
33 5376 5376+2*128 2048
... ... ... ...
41 5376 5376+10*128 2048
42 5376 5376 2048-128
43 5376 5376 2048-2*128
... ... ... ...
51 5376 5376 2048-10*128
52 5376 5376 2048+128
53 5376 5376 2048+2*128
... ... ... ...
61 5376 5376 2048+10*128

MLIR Generated CUDA kernels

I also use MLIR to generate MatMul kernels. The generated ones are in examples/matmul/mlir-gen. The performance to handwritten ones (examples/matmul/this) is shown as belows. As MLIR generated ones only implement part of the optimizations used by handwritten ones, we call the MLIR generated ones partial and the handwritten ones full.

mlir-gen Overall, MLIR generated versions achieve 86% the performance of handwritten kernels.

Plan

More kernels

I plan to implement kernels for other operators such as softmax in future.

Use CUTLASS in implementation

There is a plan to use the CuTe interface of CUTLASS to implement high-performance kernels.