facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

M2M Corpus Inclusion? #2766

Closed normanhh3 closed 4 years ago

normanhh3 commented 4 years ago

This is really just feedback for the authors of the M2M 100 model.

Based on the Venture Beat article I read about, it sounds like there is a need to expand the text corpus to include more content for some languages so that less artificial material has to be created.

If so, have you considered that the sacred Christian religion text, known simply as the Bible, could be a tremendously valuable source for development of this translation model? The reason I propose this is because of the fact that Bible translators rely on a tremendously thorough human translation process that theoretically should yield a substantial improvement.

If you are interested in following this path further there is this repository that could serve as a starting point in this direction.

I have also cross posted an issue over there for that team to reference the M2M 100 work released here.

https://github.com/christos-c/bible-corpus/issues/11

Thank you for your contribution to the advancement of communication across the world through the development of this tool.

myleott commented 4 years ago

Thanks for the suggestion! While I didn’t work on this specific project, the Bible and other similar resources are often used in this kind of work. Bible translations are also included in the OPUS project, which our team sometimes uses for low resource MT projects: http://opus.nlpl.eu/

One of the risks to be mindful of is that the style of writing in the Bible is quite a bit different than, say, news articles or social media posts. Typically one wants to train these systems on the same style (or “domain”) of writing that the system will eventually be applied to.

huihuifan commented 4 years ago

Thanks @normanhh3! The translations of the Bible have been very useful for various NLP efforts --- check out OPUS as Myle mentions or https://link.springer.com/article/10.1007/s10579-014-9287-y. The Masakhane-MT project, which focuses on African languages, trains on the JW300 corpus (https://www.aclweb.org/anthology/P19-1310/), I believe is based on Jehovah's Witness magazines.

In this work, we mainly focused on many-to-many corpora created with data mining, but indeed in the future, incorporating human translations from various sources (including WMT, WAT, IWSLT, etc as well) would be very useful. Check out Section 6.4 of the paper where I also list some existing high-quality resources for low resource languages, like various African languages, that have been created by the community.