Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.09k stars 1.18k forks source link

Hope one day flash-attention can support T4 GPU #887

Open hit56 opened 5 months ago

dijkstrabc commented 4 months ago

really need turing support, is there a way we can do this on our own /sigh

aleksanderhan commented 4 months ago

I'm signing on to this one. I need to run inference on T4 gpus..

AvivSham commented 4 months ago

+1

rationalism commented 3 months ago

OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):

https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py

AvivSham commented 3 months ago

@rationalism FlashAttention1.X works for Turing gpus, the problem is with version 2 or newer. please check the following line, it seems like it is not supported by Turing GPUs (or it is a typo)

rationalism commented 3 months ago

@AvivSham I saw that, but it does in fact seem to work on my machine Screenshot from 2024-05-01 09-04-33

eugeneswalker commented 2 months ago

Yes, please support Tesla T4!