christos-c / bible-corpus

A multilingual parallel corpus created from translations of the Bible.
Creative Commons Zero v1.0 Universal
172 stars 47 forks source link

Training Multilingual Direct Translation Model on Corpus? #11

Closed normanhh3 closed 3 years ago

normanhh3 commented 3 years ago

I came across this open source release of a new set of tools and trained model for building a model that will perform language to language translation for 100 languages without using english as an intermediary.

https://github.com/pytorch/fairseq/tree/master/examples/m2m_100

The article I read, https://venturebeat.com/2020/10/19/facebooks-open-source-m2m-100-model-can-translate-between-100-different-languages/ indicated that there were some languages where they had to generate synthetic material for translation because there wasn't enough corpus material to work with.

That got me thinking that potentially a much better translation model could be built based off of sacred text translation that has been done over the years.

The number of translations in each language and the number of languages that the Bible has been translated into certainly should be a sufficient corpus to improve the quality of translation by some degree.

So, after a little digging I came across this project, which seems like just the start needed to improve the model.

Now, I don't know who might have the funding to train the M2M 100 model on 100 GPU's, but at least there is a reference here on one way to accomplish the objective.

normanhh3 commented 3 years ago

Looks like this corpus, is already well into the knowledge space of the M2M team, which is cool.