New models - Githubissues

Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models

MIT License

312 stars 39 forks source link

New models #18

Open avostryakov opened 3 years ago

avostryakov commented 3 years ago

Thanks for all of these models! Sometimes it works comparable with Google Translate!

I noticed that you improve a model for French and several other languages. Do you have plans to do the same for es-en, pt-en, da-en, it-en pairs?

And what was the trick that improved results?

jorgtied commented 3 years ago

At the moment we focus on training models for the Tatoeba MT Challenge that we released recently (https://github.com/Helsinki-NLP/Tatoeba-Challenge). There will be some updated models there. Check it out. Otherwise, we will continue updating existing language pairs but progress may be slow as training requires a lot of resources and time. I cannot promise new models frequently.

jorgtied commented 3 years ago

And, yes, the trick to improve models is to train more. SentencePiece based segmentation is also useful and some other smallish improvements in data pre-processing.

avostryakov commented 3 years ago

Oo, great! Very thanks again for the Tatoeba-Challenge project! Recently you published a Spanish-to-English and other models that we need! By the way, about the pre-processing step for OPUS datasets. Maybe you read facebook's article: https://arxiv.org/pdf/1907.06616.pdf (Facebook FAIR’s WMT19 News Translation Task Submission). There are two important steps there:

applying language identification filtering. it can be CLD2 library, for example.
removing sentence pairs with a source/target length ratio exceeding 1.5

And, of course, back-translation. I noticed that you do something with back-translation. There is another facebook article with details: https://arxiv.org/abs/1808.09381. Only this step allows them to improve BLUE on 4 points.

jorgtied commented 3 years ago

Yes, I do apply language identification in the new Tatoeba-MT models and some other basic filtering. Length-ratio filtering has always been part of the pipeline. This is a very well-known since old SMT times and Moses tools. However, I am not as strict as the paper suggests. There is a lot of hyper-parameters that can be optimized for each language pair. Backtranslation is part of all models that include "+bt" in their string. I need to stress that the OPUS-MT models are not tuned towards news translation from the WMT tests. It is not surprising if their are performance differences as simple domain-adaptation boosts the performance a lot. I will try to also include some fine-tuned models later. A finetuning framework is already integrated in OPUS-MT

By the way, it's a bit funny that most people point to Facebook/Google papers when they refer to techniques developed and proposed by researchers in academia. I guess that universities have to improve their PR units ...