Does this work for mt5 architectiture?

Knowledgator / TurboT5

Truly flash T5 realization!

32 stars 3 forks source link

Does this work for mt5 architectiture? #4

Open hrsmanian opened 1 week ago

hrsmanian commented 1 week ago

Hi, First of all, great work. I am big proponent of FLan-t5 and use it in my projects. For multilingual, mT5 and bigscience/mt0 models provide a good baseline and are truly multilingual. Does Flash Attention work on mt5 architecture? Seems like only T5 is supported now?

https://huggingface.co/bigscience/mt0-large is something I am looking at which is based on mT5

Thanks for the great work

Ingvarstep commented 1 week ago

@hrsmanian , thank you for your interest in the project. From the first look, it should work, but I didn't make comprehensive tests. If you tested let me know whether it works or not.

mariothedev commented 1 week ago

@Ingvarstep - does it work?

hrsmanian commented 5 days ago

It has a different base class in Transformers. So while it is able to load, am not sure if this is the correct output

Ingvarstep commented 4 days ago

I have tested https://huggingface.co/bigscience/mt0-base and it works and produces the same outputs as transformers MT5ForConditionalGeneration. Even if the classes are different for PyTorch is important to match keys of weights.

hrsmanian commented 4 days ago

Thanks for checking. did you observe any speed up? I did not observe any speed up either. Also, since the classes and tokenizer are different, would it not be better to have a separate implementation for mt5?

Ingvarstep commented 4 days ago

For generation, attention mechanism is not a bottleneck, especially for small sequences. You can have speed up at sequence length 4k+. Regarding a separate class, I don't see any reason for it right now.

Ingvarstep commented 4 days ago

But Flash attention is clearly beneficial for training