Open xenshinu opened 3 months ago
I'm not an expert on this, but why do you assume that the two backends should perform exactly the same? I believe the reason we have multiple backends is precisely in order to pick the most performant one in each situation. FlashAttention has been continuously optimized so I'm not too surprised that it performs better.
Well I saw this comment https://github.com/facebookresearch/xformers/issues/950#issuecomment-1864793941 that the cutlass.op
is only "a bit slower". Thus I was thinking they are the same algorithm with different impls. Thanks for your reply.
They are indeed the same mathematic algorithm (in terms of mathematical operations), but the way work is parallelized and scheduled is a bit different. Plus the implementation details matter a lot when optimizing CUDA kernels :)
❓ Questions and Help
Hi, I did a simple profiling of xformers cutlass implementation of attention vs flash attn. I was assuming they are the same algorithm with different implementation on Ampere+, but the time consumption is very different. The cutlass op is almost 2x slower than flash op. Both of them are run on A40. The code is as below
I also tested on different batch size, the result is similar. BTW, I think flash-attn is also written with CUTLASS, so what is the difference? (except it can run on Pascal+)