facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.66k stars 614 forks source link

CUTLASS Fused multi head attention #1112

Open yoon5862 opened 1 month ago

yoon5862 commented 1 month ago

❓ Questions and Help

Hello, I am watching fused multi-head attention in 3rdparty/cutlass. In cutlass/examples, fused multi head attention is upstream to xformers. And CUTLASS said fused multi head attention examples is same as flash attention-2. Is it true that cutlass fused multi head attention and flash attention-2 kernel is same things? Thank you.

danthe3rd commented 1 month ago

And CUTLASS said fused multi head attention examples is same as flash attention-2.

I believe those are not the same thing. Where did you see that? Flash-Attention 2 is built using the CUTLASS library, but what we call "cutlass" implementation in xFormers, and what is in cutlass/examples is something else.

yoon5862 commented 1 month ago

thank you for relpy. In CUTLASS examples, is said it's code is upstream to xformers.

Acknowledgement: Fixed-sequence-length FMHA code was upstreamed by Meta xFormers (https://github.com/facebookresearch/xformers).

therefore I think xformers use cutlass custom kernel and tuned it's kernels for oracle setting for kernel.