Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
14.36k stars 1.34k forks source link

Turing architecture support #542

Open SimJeg opened 1 year ago

SimJeg commented 1 year ago

Hello, just reopening this issue as I would love to use FA2 on T4 GPUs ^^

tridao commented 1 year ago

I haven't had much bandwidth to work on Turing.

SimJeg commented 1 year ago

@tridao for more context, I recently published a post on the current Kaggle LLM science exam competition (here) showing that it's possible to run a 70B model on a single T4 GPU. However I am still limited because of vRAM OOM when using long context. There is already a code for Llama2 + HF (here) but it requires FA2 and thus does not work on T4 GPUs.

david-macleod commented 1 year ago

@tridao I could take a look at implementing this as also keen for T4 support, any obvious caveats that spring to mind?

jfpuget commented 1 year ago

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

tridao commented 1 year ago

I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?

jfpuget commented 1 year ago

Awesome. Yes, just inference in that competition.

sumanthnallamotu commented 9 months ago

Hi, has there been any update on this?

tridao commented 9 months ago

Hi, has there been any update on this?

No I haven't had much time

suicao commented 7 months ago

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

I'll have to bring this up again, there is a new $1M Kaggle competition that really needs the performance boost from flash attention https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize

chuanzhubin commented 7 months ago

Many of our school partners, or individual researchers, are using the 2080ti Turing architecture. Please support this architecture, flash_attn.

Dampfinchen commented 6 months ago

Still no news?

tridao commented 6 months ago

Nope I've had no bandwidth

rationalism commented 6 months ago

OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):

https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py

Seedmanc commented 2 months ago

So does this mean 2070s supports at least flash attention 1? Is that the same as SDPA? I was under impression that my GPU got no luck for any kind of flash attention, and the kohya_ss trainer keeps saying "Torch was not compiled with flash attention" even though I enabled SDPA and it's indeed faster. I wonder if I should even bother looking into that.