Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
11.82k stars 1.05k forks source link

Turing architecture support #542

Open SimJeg opened 9 months ago

SimJeg commented 9 months ago

Hello, just reopening this issue as I would love to use FA2 on T4 GPUs ^^

tridao commented 9 months ago

I haven't had much bandwidth to work on Turing.

SimJeg commented 9 months ago

@tridao for more context, I recently published a post on the current Kaggle LLM science exam competition (here) showing that it's possible to run a 70B model on a single T4 GPU. However I am still limited because of vRAM OOM when using long context. There is already a code for Llama2 + HF (here) but it requires FA2 and thus does not work on T4 GPUs.

david-macleod commented 9 months ago

@tridao I could take a look at implementing this as also keen for T4 support, any obvious caveats that spring to mind?

jfpuget commented 9 months ago

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

tridao commented 9 months ago

I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?

jfpuget commented 9 months ago

Awesome. Yes, just inference in that competition.

sumanthnallamotu commented 4 months ago

Hi, has there been any update on this?

tridao commented 4 months ago

Hi, has there been any update on this?

No I haven't had much time

suicao commented 3 months ago

I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.

I'll have to bring this up again, there is a new $1M Kaggle competition that really needs the performance boost from flash attention https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize

chuanzhubin commented 2 months ago

Many of our school partners, or individual researchers, are using the 2080ti Turing architecture. Please support this architecture, flash_attn.

Dampfinchen commented 2 months ago

Still no news?

tridao commented 2 months ago

Nope I've had no bandwidth

rationalism commented 2 months ago

OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):

https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py