Add a simple FlashAttention based on Back2Back GEMM.

TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

MIT License

158 stars 10 forks source link

Add a simple FlashAttention based on Back2Back GEMM. #122

Closed KuangjuX closed 2 months ago

KuangjuX commented 3 months ago

The current implementation of FlashAttention (PR #123) relies on relatively fine-grained atomic instructions. To better align with the intuition of data flow analysis, we need to carefully consider the granularity, performance, and reasonableness of each atomic instruction.