Open Natkeeran opened 5 years ago
Natkeeran, the website wsws.org has same articles in Tamil English and Sinhala and other languages too. It seems oriented to a certain political beliefs, but the language translation quality is amazing. I had written some crawler a while ago to parse the articles and pair them up, I will try to locate and share that.
Here is a gist. Sorry my personal used code so not yet cleaned up, but can give you an idea and tricks.
https://gist.github.com/ravi-annaswamy/c373d845c95b5ee2a97bd51578aebfb4
@ravi-annaswamy Thank you for sharing.
We need a parallel corpora in multiple languages to train machine translation algorithms. One method to generate this data is by crawling multilingual websites.
The tooling to do this has been developed here: https://github.com/bitextor/bitextor European language datasets: https://paracrawl.eu/releases.html Mozilla machine translation project: https://browser.mt (only for European languages)
Sri Lanka government and to lesser extent Tamil Nadu government have many multilingual websites. For example: parliament.lk and tn.gov.in. We need to crawl and cleanup/process these websites to create a Tamil - English - Sinhala corpora.