Open thusinh1969 opened 2 months ago
it's supported
it's supported
Algorithm 1 FlashAttention-3 forward pass without intra-consumer overlapping Algorithm 2 FlashAttention-3 consumer warpgroup forward pass
Is the current implementation of the Algorithm 2? Is there an implementation of the Algorithm 1 available? I would like to make a comparison. Could you please provide the complete code of Algorithm 1? thanks!
Hi @tridao,
I recently installed the latest version of FlashAttention using the following command:
pip install -U https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
I am using AutoModelForCausalLM
from the Hugging Face Transformers library, which I've also upgraded to the latest version. However, I'm still noticing some unexpected results during inference.
Before I dive into debugging other parts of my code, I want to confirm whether the version of FlashAttention I've installed is fully compatible with the latest version of Hugging Face Transformers. Could this version mismatch be causing the issues I'm seeing?
Thanks for your help!
Idk anything about HF transformers
Thank you for your response. May I then assume that the version I installed supports the soft capping operation?
Idk anything about HF transformers
@tridao Is the current implementation of the Algorithm 2? Is there an implementation of the Algorithm 1 available? I would like to make a comparison. Could you please provide the complete code of Algorithm 1? thanks!
Great product, TriDao (quá giỏi bạn tôi).
Shall your Flash-attn support Gemma-2 soft-capping anytime soon ? We inspired much by Gemma-2 quality and hence we would stick to it should context-length can be expanded. Unsloth has some options but they support only single GPU.
Thanks much bạn, Cheers Nguyên