catie-aq / flashT5

A fast implementation of T5/UL2 in PyTorch using Flash Attention
Apache License 2.0
59 stars 7 forks source link

is Flash Attention a requirement? #4

Open SoshyHayami opened 3 months ago

SoshyHayami commented 3 months ago

Hi. First of all, thank you so much for this awesome work. I really want to try it out but the problem is that I use a few V100 cards and unfortunately they don't support Flash Attention 2. so I was wondering if I should, nonetheless, try using this repo?

tbh I don't really care that much about Flash Attention, I just needed a good starting point to train the large or the XL variant of the model in torch from scratch. most scripts I see use Jax or TPUs or are very hard-coded, thus difficult to work with.

So, I want to know whether FA2 is a requirement. Thanks

b-albar commented 2 months ago

Hi, no you can use the reference implementation of attention. This is specified in the config file by putting attention_type: "ref". Possible values are "ref", "triton" and "fa2". The last two are flash attention in triton and using the patched implementation of the original kernel respectively.