KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
40 stars 3 forks source link

Parallel crawl multilingual websites to generate corpora for machine translation #91

Open Natkeeran opened 5 years ago

Natkeeran commented 5 years ago

We need a parallel corpora in multiple languages to train machine translation algorithms. One method to generate this data is by crawling multilingual websites.

The tooling to do this has been developed here: https://github.com/bitextor/bitextor European language datasets: https://paracrawl.eu/releases.html Mozilla machine translation project: https://browser.mt (only for European languages)

Sri Lanka government and to lesser extent Tamil Nadu government have many multilingual websites. For example: parliament.lk and tn.gov.in. We need to crawl and cleanup/process these websites to create a Tamil - English - Sinhala corpora.

ravi-annaswamy commented 5 years ago

Natkeeran, the website wsws.org has same articles in Tamil English and Sinhala and other languages too. It seems oriented to a certain political beliefs, but the language translation quality is amazing. I had written some crawler a while ago to parse the articles and pair them up, I will try to locate and share that.

ravi-annaswamy commented 5 years ago

Here is a gist. Sorry my personal used code so not yet cleaned up, but can give you an idea and tricks.

https://gist.github.com/ravi-annaswamy/c373d845c95b5ee2a97bd51578aebfb4

Natkeeran commented 4 years ago

@ravi-annaswamy Thank you for sharing.