anoopkunchukuttan / indic_nlp_library

Resources and tools for Indian language Natural Language Processing
http://anoopkunchukuttan.github.io/indic_nlp_library/
MIT License
546 stars 158 forks source link

Undo wrong Moses tokenization #36

Open anoopkunchukuttan opened 3 years ago

anoopkunchukuttan commented 3 years ago

Some datasets have been pre-processed with Moses tokenizer (or some other tokenizer), which incorrectly handles halant, considering it to be punctuation and adding spaces around it. Add functionality in the normalizer to undo this behaviour.

tathagata-raha commented 3 years ago

Hi @anoopkunchukuttan, can you add an example to it?