TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
159 stars 10 forks source link

Add flash attention based on b2b GEMM #16

Closed KuangjuX closed 2 months ago

KuangjuX commented 7 months ago

I would suggest prepare the implementation of the flash attention algorithm (I prefer calling it an parallel algorithm).

I think flash attention has many implications for thinking about scheduling an efficient DNN computational process. Since it involves a combination of several elements, including the reuse of TCU's output at the register level, warp reduction, element-wise operations, and the arrangement of warps, among others.

Preparing the implementation first will allow us to observe how to organize the structure of the computational process.