anoopkunchukuttan / indic_nlp_library

Resources and tools for Indian language Natural Language Processing
http://anoopkunchukuttan.github.io/indic_nlp_library/
MIT License
546 stars 158 forks source link

Preserve abbreviation punctuation for Tokenization & adding more abbreviations for Sentence Splitting #30

Open rhn19 opened 4 years ago

rhn19 commented 4 years ago

The Marathi corpus has ~1M sentences and the Hindi corpus has ~7M sentences which are incorrectly split due to lack of a few language-specific abbreviations. Unfortunately, as the sentences are shuffled there is no way to get the original sentence back. A few abbreviations I noticed are missing from sentence_tokenize.py : प्रा. (private), जि. (district). Abbreviations can be changed to preserve the ending '.' while tokenizing to avoid incorrect sentence splits.

A quick fix for this is limiting the sentence lengths to 5-50 words. Most of the sentences lying outside this region are affected by this. I have attached a sample errors.txt file which contains a few of the incorrectly split sentences. mr_errors.txt