Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
312 stars 39 forks source link

Lack of transparency on used training data. - Does finetuning make sense? #92

Open Thybo-D opened 1 year ago

Thybo-D commented 1 year ago

I'm aware the translation models are trained on training data from the OPUS corpus. But for me it's very unclear on how much data exactly they have trained these models and whether they have used ALL available data from the OPUS corpus given the language directions.

Does it make sense to download OPUS data and further finetune these models?

Does it make sense to find other data sources and finetune the models? If so, how much sentence pairs (approximately) do I need to see an improvement?

I'm particularly interested in finetuning "Helsinki-NLP/opus-mt-nl-en" and "Helsinki-NLP/opus-mt-en-nl" .

jorgtied commented 1 year ago

Yes, more or less all data in OPUS at that time of training. I am not sure about fine-tuning. It may also forget about previously learned information. You could continue training with some larger data set but then you may need some longer warm-up time as well to get the optimizer back on track.