dialect-app / dialect

A translation app for GNOME.
https://dialectapp.org/
GNU General Public License v3.0
609 stars 69 forks source link

[Feature req.] Integrate Bergamot for offline translations! #183

Open okias opened 3 years ago

okias commented 3 years ago

Bergamot is a consortium coordinated by the University of Edinburgh with partners Charles University in Prague, the University of Sheffield, University of Tartu, and Mozilla. [1]

Since I just tested some basic offline translations as described in HOWTO [2] and it worked very nicely (at least for few examples I tried), maybe would be good idea to integrate it into Dialect? It's developed under Free and open-source licenses, so every component should be usably by Dialect and there is no need to be worry about API keys or ratelimiting since it's all offline. Other possibility is downloading translation models on demand (as Mozilla extension) or download them in advance.

[1] https://browser.mt/ [2] https://github.com/mozilla-extensions/bergamot-browser-extension/blob/v0.4.0/docs/INSTALL.md

rafaelmardojai commented 3 years ago

Sounds interesting.

We also talked internally about adding a EasyNMT backend, but the major drawback were model sizes. We also will need to implement a download manager to deal with models.

We need to investigate a bit more, but definitely would be awesome to have offline translation capabilities.

jelmervdl commented 3 years ago

The model sizes for bergamot are relatively small if you stick to the tiny11 models[^1], and are able to download them on demand.

There are many models available for marian, and by extension bergamot. But I'm not sure whether they'll work out of the box with bergamot's fork of marian at the moment. I think they will, just bigger, slower and more resource intensive.

Bergamot is based on marian, which makes it really fast. But there's no python interface. The bergamot project is developing a layer (bergamot-translator) to make it easier integrate marian into software: It takes care of model loading, sentence splitting, all that stuff. translateLocally is using it (which I'm working on, which is how I got interested in Dialect). If you do not really need all the performance optimisations, you could probably also just stream sentences through the marian-decoder binary via stdin/stdout (when you use --max-batch-size=1). I believe this is what OPUS-CAT does. You would still need to implement sentence splitting before it.

Another issue might be that currently there's no easy way to do translation using a pivot language, like Google Translate does. This means that you'll end up with N*N models for N languages… or for Dialect that the language select UI doesn't really work. It's something being tracked in bergamot, but it's not there yet.

I know of another python-based offline translation app, Argos Translate. That uses CTranslate2 which has a python module. I'm not sure about model availability, but they have a model list that has grown quite a bit since I last checked. No idea about quality and speed of those, but last time I checked it wasn't close to that of bergamot.

[^1]: Table with tiny11 model sizes: https://gist.github.com/jelmervdl/1a48816e4c3643ff5d9e1fd6821bc499

jerinphilip commented 2 years ago

256 looks like a duplicate of this issue. To further update on @jelmervdl's comments above:

  1. manylinux wheels are now available for installation from PyPI.
  2. Pivoting is implemented.
$ bergamot ls -r opus
Available models: 
    1. eng-fin-tiny English-Finnish
    2. swe-fin-tiny Swedish-Finnish
    3. ukr-swe-tiny Ukrainian-Swedish
    4. ukr-fin-tiny Ukrainian-Finnish
    5. ukr-dan-tiny Ukrainian-Danish
    6. ukr-nob-tiny Ukrainian-Norwegian Bokmål
    7. ukr-tur-tiny Ukrainian-Turkish
    8. ukr-bul-tiny Ukrainian-Bulgarian
    9. ukr-hun-tiny Ukrainian-Hungarian
   10. ukr-ron-tiny Ukrainian-Romanian
   11. fin-ukr-tiny Finnish-Ukrainian
$ bergamot ls -r browsermt
Available models: 
    1. cs-en-base Czech-English base
    2. cs-en-tiny Czech-English tiny
    3. en-cs-base English-Czech base
    4. en-cs-tiny English-Czech tiny
    5. de-en-base German-English base
    6. de-en-tiny German-English tiny
    7. en-de-base English-German base
    8. en-de-tiny English-German tiny
    9. es-en-tiny Spanish-English tiny
   10. en-es-tiny English-Spanish tiny
   11. et-en-tiny Estonian-English tiny
   12. en-et-tiny English-Estonian tiny
   13. is-en-tiny Icelandic-English tiny
   14. nb-en-tiny Norwegian (Bokmål)-English tiny
   15. nn-en-tiny Norwegian (Nynorsk)-English tiny

There are more models coming by 2023. With pivoting implemented as of now, a lot more language-directions are supported.

We also will need to implement a download manager to deal with models.

I have currently implemented a crude one using python, perhaps we can improve it. Development currently happens on the https://github.com/browsermt/bergamot-translator repository.

rafaelmardojai commented 2 years ago

Right now the harder part is to adapt Dialect for offline translation. In it's current form Dialect is very tied to libsoup and its async apis, so we need to make it more flexible.

I have currently implemented a crude one using python, perhaps we can improve it.

Sounds good as a base to use. We have the hard requirement of using libsoup as HTTP client. And probably we will want a reusable thing for other offline translation providers.

garrett commented 2 years ago

Related; might possibly be useful, although it's JavaScript instead of Python:

Web frontend to Mozilla's Bergamot (works offline once it downloads the translation files) @ https://mozilla.github.io/translate/ Source @ https://github.com/mozilla/translate

jelmervdl commented 2 years ago

Bergamot[^1] since has a Python API and packages available on pypi. You can use the translation modes Mozilla is training with it, but also the ones from OPUS or any marian-nmt model. And support for more languages is coming.

Right now the harder part is to adapt Dialect for offline translation. In it's current form Dialect is very tied to libsoup and its async apis, so we need to make it more flexible.

That to me looks like the most difficult thing about integrating bergamot (or any other non-HTTP translation service) into Dialect.

I quickly looked at libsoup and whether it would allow you to integrate a protocol handler or something, and hook into it that way, but couldn't find anything.

Wrapping bergamot-translator into a HTTP server would be easy (there have been several projects out there that do that already…) but then you'd be running a local web server just for integrating it, which sounds a bit overly complicated.

[^1]:not really Mozilla's, they're just one of the partners working on that project.