Open SimJeg opened 9 months ago
I haven't had much bandwidth to work on Turing.
@tridao for more context, I recently published a post on the current Kaggle LLM science exam competition (here) showing that it's possible to run a 70B model on a single T4 GPU. However I am still limited because of vRAM OOM when using long context. There is already a code for Llama2 + HF (here) but it requires FA2 and thus does not work on T4 GPUs.
@tridao I could take a look at implementing this as also keen for T4 support, any obvious caveats that spring to mind?
I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.
I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?
Awesome. Yes, just inference in that competition.
Hi, has there been any update on this?
Hi, has there been any update on this?
No I haven't had much time
I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.
I'll have to bring this up again, there is a new $1M Kaggle competition that really needs the performance boost from flash attention https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize
Many of our school partners, or individual researchers, are using the 2080ti Turing architecture. Please support this architecture, flash_attn.
Still no news?
Nope I've had no bandwidth
OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):
https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py
Hello, just reopening this issue as I would love to use FA2 on T4 GPUs ^^