Lack of transparency on used training data. - Does finetuning make sense?

Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models

MIT License

312 stars 39 forks source link

I'm aware the translation models are trained on training data from the OPUS corpus. But for me it's very unclear on how much data exactly they have trained these models and whether they have used ALL available data from the OPUS corpus given the language directions.

Does it make sense to download OPUS data and further finetune these models?

Does it make sense to find other data sources and finetune the models? If so, how much sentence pairs (approximately) do I need to see an improvement?

I'm particularly interested in finetuning "Helsinki-NLP/opus-mt-nl-en" and "Helsinki-NLP/opus-mt-en-nl" .

Helsinki-NLP / OPUS-MT-train

Lack of transparency on used training data. - Does finetuning make sense? #92