christos-c / bible-corpus

A multilingual parallel corpus created from translations of the Bible.
Creative Commons Zero v1.0 Universal
172 stars 47 forks source link

Problem with Amharic file #3

Closed rasoolims closed 7 years ago

rasoolims commented 7 years ago

Hi,

I created a simple script to create all pairs of aligned files (for Giza++ and other aligners). It seems that the Amharic file has problems (illegal XML characters in text).

Thanks

christos-c commented 7 years ago

Thanks for catching that @rasoolims. I have replaced the few '<<' and '>>' characters with ". I have also found some stray HTML markup at the end of some verses which I have removed. Do you mind adding your script in my Corpus Tools package? Right now I only have Java scripts and I know that a lot of researchers are more familiar with Python.