Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

Syntax for targeting language variants like fr_BE or fr_CA #39

Open rococode opened 3 years ago

rococode commented 3 years ago

The Romance languages model seems to have a variety of variants like Belgian French, Canadian French, etc. I was wondering, is there a correct syntax to translate into these languages?

For example, for just French, I can prepend >>fr<<. But >>fr_BE<<, >>frbe<<, >>fr_be<< etc. don't seem to work (I get Italian instead).

jorgtied commented 3 years ago

It is probably the case that those language variants have too little data to make the model recognize the language label. I did not test it myself but we should probably test this more carefully to make sense of all language labels. So far, this model does not use any over/under-sampling. Maybe some additional fine-tuning for individual language pairs could do the trick?

jorgtied commented 3 years ago

By the way, one way to check whether the language label is supported is to grep for the token in the vocabulary file. You could do this to see whether the label is available at all:

grep '>>' *.vocab.yml