Open eileen2003-w opened 1 month ago
I wish for the same, I don't get why there's a Flash Attention 3 for Hopper GPUs no ordinary consumer can get, (yes I know it's technically still in beta), but still no Turing support for Flash Attention 2
It's because there are people willing to put in the work to make it work for Hopper. There have yet to be people contributing to make it work for Turing.
It's because there are people willing to put in the work to make it work for Hopper. There have yet to be people contributing to make it work for Turing.
Just make a fallback to Flash Attention 1 integrated in Flash Attention 2. It's so, so frustrating. When I pip install torch my training program always falls back to CPU because "torch was not compiled with flash attention". When I uninstall that and get the new build with flash attention compiled it says its not supported on my platform (Turing and Windows).
What the? Why? Just why? Atleast let me use Flash Attention 1.
I have already downloaded Flash-attention 1.x(actually flash-attn 1.0.8) because currently I only have a GPU with TURING architecture(TITAN RTX). But for my needs (running a demo of a multimodal LLM), it requires flash-attn 2.x, and here is the corresponding code: from flash_ttn import flash_ttn_func as _flash3 attnf unc, flash_attn_varlen_func as _flash_attn_varlen_func from flash_attn.bert_padding import pad_input as _pad_input, index_first_axis as _index_first_axis, unpad_input as _unpad_input flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
Every time it runs, an error is thrown, and it seems that it cannot be used with the version of 1. x. I am currently very troubled by this issue, and it would be great if there were flash attention available for GPUs that are compatible with the TURING architecture.