OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.26k stars 287 forks source link

Support for Facebook's new SeamlessM4T (Multilingual + Multimodal) #1429

Open Infinitay opened 1 year ago

Infinitay commented 1 year ago

Facebook just released a new multimodal model for multiple languages. I would assume it's the successor to NLLB. One model to rule them all. It would be amazing to have CT2 support for this to further reduce the size of the large model. If I remember correctly, when I used Whisper large and NLLB-200 medium, I was using about 9-10 GB of VRAM with what should be under 3B parameters. Switching to CT2s whisper large-v2 and NLLB-200 medium (both float16) took me to 5-6 GB of VRAM. I'm hoping that with CT2 support for SeamlessM4T we can see similar improvements with negligible loss of accuracy all while maintaining solid multimodal metrics. That being said, in the future if there is support for SM4T, would you be as so kind as to include metrics of vanilla SM4T and CT2's SM4T for as many tasks (e.g. S2TT, T2TT, etc.) possible? If not, maybe a script so we can analyze it ourselves?

Thanks, hopefully it's not much of an ask to add support for in the future and that other people can take advantage of this.

image

Website: https://ai.meta.com/resources/models-and-libraries/seamless-communication/ Code: https://github.com/facebookresearch/seamless_communication Paper: https://ai.meta.com/research/publications/seamless-m4t/ Blog Post: https://ai.meta.com/blog/seamless-m4t/

Some Metrics ![image](https://github.com/Sharrnah/whispering/assets/6964154/99ba2dc2-af3b-4375-85cb-a39baa660753) ![image](https://github.com/Sharrnah/whispering/assets/6964154/18a8df3a-d848-42e8-b696-63bf42cfa9b4) ![image](https://github.com/Sharrnah/whispering/assets/6964154/26630236-ff4f-4c0f-b1b8-76a3582b2602)
hobodrifterdavid commented 1 year ago

The speech-to-speech translation of this model is pretty good, there's an online demo here: https://seamless.metademolab.com/

vince62s commented 1 year ago

The demo looks nice overall but doing this first simple test here:

image

ASR is good 100% Translation is wrong when DeepL and Google Translate are 100% accurate. TTS is good but with wrong translation.

The issue with Meta's model (it was already the case with NLLB) is that the research goal is really usefull but when it is not SOTA and gives glitches like this, in the end you are reluctant to use these. If they were open sourcing 100% then community could contribute to improve their work.

don't get me wrong the work remains impressive.

hobodrifterdavid commented 1 year ago

In the paper they compare it to a cascaded approach (ASR, then translation, then TTS), I didn't look in detail, the nice thing here is it's all one model, easy to deploy, for 35 to 35 languages.

For TTS, outside of a handful of languages, it's hard to find decent sounding models (comparable to say, Microsoft apis).. seamless seems to do pretty well in terms of 'technical' quality, but the specific voices it's fine tuned on could have been better. It sounds like they used ljspeech for English, they could have used our Jenny dataset rather (https://youtu.be/JZWeYbtCisk?si=xfP-Km3ZFGRI7ZTZ&t=239). :D