Open SimJeg opened 1 year ago
I haven't had much bandwidth to work on Turing.
@tridao for more context, I recently published a post on the current Kaggle LLM science exam competition (here) showing that it's possible to run a 70B model on a single T4 GPU. However I am still limited because of vRAM OOM when using long context. There is already a code for Llama2 + HF (here) but it requires FA2 and thus does not work on T4 GPUs.
@tridao I could take a look at implementing this as also keen for T4 support, any obvious caveats that spring to mind?
I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.
I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?
Awesome. Yes, just inference in that competition.
Hi, has there been any update on this?
Hi, has there been any update on this?
No I haven't had much time
I concur with SimJeg, enabling FA2 for Turing would yield a massive exposure to Kaggle community.
I'll have to bring this up again, there is a new $1M Kaggle competition that really needs the performance boost from flash attention https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize
Many of our school partners, or individual researchers, are using the 2080ti Turing architecture. Please support this architecture, flash_attn.
Still no news?
Nope I've had no bandwidth
OpenAI's Triton implementation of flash attention works on Turing GPUs (just tested this myself):
https://github.com/openai/triton/blob/main/python/tutorials/06-fused-attention.py
So does this mean 2070s supports at least flash attention 1? Is that the same as SDPA? I was under impression that my GPU got no luck for any kind of flash attention, and the kohya_ss trainer keeps saying "Torch was not compiled with flash attention" even though I enabled SDPA and it's indeed faster. I wonder if I should even bother looking into that.
Hello, just reopening this issue as I would love to use FA2 on T4 GPUs ^^