PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models
Apache License 2.0
970 stars 74 forks source link

Flash attention #28

Closed Taytay closed 9 months ago

Taytay commented 10 months ago

Firstly, thank you so much for this repo! I'm a huge fan of T5, and these results are extremely impressive.

I saw that you experimented with different positional embeddings like ALiBi in order to facilitate FA down the line. Was that attempt due to the fact that FA doesn't support bias? If so, there is a PR to add it that is making progress:

https://github.com/Dao-AILab/flash-attention/pull/617

It would be fun to see this repo get even faster.

PiotrNawrot commented 9 months ago

@Taytay Thanks for the nice comments, I'm glad you like the repo! Please accept my apologies for the late reply. I've been very busy lately with the ICML submission.

Yes, exactly. FA didn't support back propagation through the extra additive bias (after dot-products, before softmax). I've just noticed this PR, it looks great - I'm sure that backprop through these bias would help not only in the T5 case! Can't wait to have it merged to FA. I'll defo test it soon after it : ).

Closing for now

harish-kamath commented 9 months ago

Someone has started a repo, based off of this one, with FA2 support @catie-aq

https://github.com/catie-aq/flashT5