Open hrsmanian opened 1 week ago
@hrsmanian , thank you for your interest in the project. From the first look, it should work, but I didn't make comprehensive tests. If you tested let me know whether it works or not.
@Ingvarstep - does it work?
It has a different base class in Transformers. So while it is able to load, am not sure if this is the correct output
I have tested https://huggingface.co/bigscience/mt0-base and it works and produces the same outputs as transformers MT5ForConditionalGeneration. Even if the classes are different for PyTorch is important to match keys of weights.
Thanks for checking. did you observe any speed up? I did not observe any speed up either. Also, since the classes and tokenizer are different, would it not be better to have a separate implementation for mt5?
For generation, attention mechanism is not a bottleneck, especially for small sequences. You can have speed up at sequence length 4k+. Regarding a separate class, I don't see any reason for it right now.
But Flash attention is clearly beneficial for training
Hi, First of all, great work. I am big proponent of FLan-t5 and use it in my projects. For multilingual, mT5 and bigscience/mt0 models provide a good baseline and are truly multilingual. Does Flash Attention work on mt5 architecture? Seems like only T5 is supported now?
https://huggingface.co/bigscience/mt0-large is something I am looking at which is based on mT5
Thanks for the great work