Open okias opened 3 years ago
Sounds interesting.
We also talked internally about adding a EasyNMT backend, but the major drawback were model sizes. We also will need to implement a download manager to deal with models.
We need to investigate a bit more, but definitely would be awesome to have offline translation capabilities.
The model sizes for bergamot are relatively small if you stick to the tiny11 models[^1], and are able to download them on demand.
There are many models available for marian, and by extension bergamot. But I'm not sure whether they'll work out of the box with bergamot's fork of marian at the moment. I think they will, just bigger, slower and more resource intensive.
Bergamot is based on marian, which makes it really fast. But there's no python interface. The bergamot project is developing a layer (bergamot-translator) to make it easier integrate marian into software: It takes care of model loading, sentence splitting, all that stuff. translateLocally is using it (which I'm working on, which is how I got interested in Dialect). If you do not really need all the performance optimisations, you could probably also just stream sentences through the marian-decoder binary via stdin/stdout (when you use --max-batch-size=1). I believe this is what OPUS-CAT does. You would still need to implement sentence splitting before it.
Another issue might be that currently there's no easy way to do translation using a pivot language, like Google Translate does. This means that you'll end up with N*N models for N languages… or for Dialect that the language select UI doesn't really work. It's something being tracked in bergamot, but it's not there yet.
I know of another python-based offline translation app, Argos Translate. That uses CTranslate2 which has a python module. I'm not sure about model availability, but they have a model list that has grown quite a bit since I last checked. No idea about quality and speed of those, but last time I checked it wasn't close to that of bergamot.
[^1]: Table with tiny11 model sizes: https://gist.github.com/jelmervdl/1a48816e4c3643ff5d9e1fd6821bc499
$ bergamot ls -r opus
Available models:
1. eng-fin-tiny English-Finnish
2. swe-fin-tiny Swedish-Finnish
3. ukr-swe-tiny Ukrainian-Swedish
4. ukr-fin-tiny Ukrainian-Finnish
5. ukr-dan-tiny Ukrainian-Danish
6. ukr-nob-tiny Ukrainian-Norwegian Bokmål
7. ukr-tur-tiny Ukrainian-Turkish
8. ukr-bul-tiny Ukrainian-Bulgarian
9. ukr-hun-tiny Ukrainian-Hungarian
10. ukr-ron-tiny Ukrainian-Romanian
11. fin-ukr-tiny Finnish-Ukrainian
$ bergamot ls -r browsermt
Available models:
1. cs-en-base Czech-English base
2. cs-en-tiny Czech-English tiny
3. en-cs-base English-Czech base
4. en-cs-tiny English-Czech tiny
5. de-en-base German-English base
6. de-en-tiny German-English tiny
7. en-de-base English-German base
8. en-de-tiny English-German tiny
9. es-en-tiny Spanish-English tiny
10. en-es-tiny English-Spanish tiny
11. et-en-tiny Estonian-English tiny
12. en-et-tiny English-Estonian tiny
13. is-en-tiny Icelandic-English tiny
14. nb-en-tiny Norwegian (Bokmål)-English tiny
15. nn-en-tiny Norwegian (Nynorsk)-English tiny
There are more models coming by 2023. With pivoting implemented as of now, a lot more language-directions are supported.
We also will need to implement a download manager to deal with models.
I have currently implemented a crude one using python, perhaps we can improve it. Development currently happens on the https://github.com/browsermt/bergamot-translator repository.
Right now the harder part is to adapt Dialect for offline translation. In it's current form Dialect is very tied to libsoup and its async apis, so we need to make it more flexible.
I have currently implemented a crude one using python, perhaps we can improve it.
Sounds good as a base to use. We have the hard requirement of using libsoup as HTTP client. And probably we will want a reusable thing for other offline translation providers.
Related; might possibly be useful, although it's JavaScript instead of Python:
Web frontend to Mozilla's Bergamot (works offline once it downloads the translation files) @ https://mozilla.github.io/translate/ Source @ https://github.com/mozilla/translate
Bergamot[^1] since has a Python API and packages available on pypi. You can use the translation modes Mozilla is training with it, but also the ones from OPUS or any marian-nmt model. And support for more languages is coming.
Right now the harder part is to adapt Dialect for offline translation. In it's current form Dialect is very tied to libsoup and its async apis, so we need to make it more flexible.
That to me looks like the most difficult thing about integrating bergamot (or any other non-HTTP translation service) into Dialect.
I quickly looked at libsoup and whether it would allow you to integrate a protocol handler or something, and hook into it that way, but couldn't find anything.
Wrapping bergamot-translator into a HTTP server would be easy (there have been several projects out there that do that already…) but then you'd be running a local web server just for integrating it, which sounds a bit overly complicated.
[^1]:not really Mozilla's, they're just one of the partners working on that project.
Bergamot is a consortium coordinated by the University of Edinburgh with partners Charles University in Prague, the University of Sheffield, University of Tartu, and Mozilla. [1]
Since I just tested some basic offline translations as described in HOWTO [2] and it worked very nicely (at least for few examples I tried), maybe would be good idea to integrate it into Dialect? It's developed under Free and open-source licenses, so every component should be usably by Dialect and there is no need to be worry about API keys or ratelimiting since it's all offline. Other possibility is downloading translation models on demand (as Mozilla extension) or download them in advance.
[1] https://browser.mt/ [2] https://github.com/mozilla-extensions/bergamot-browser-extension/blob/v0.4.0/docs/INSTALL.md