Preserve abbreviation punctuation for Tokenization & adding more abbreviations for Sentence Splitting

The Marathi corpus has ~1M sentences and the Hindi corpus has ~7M sentences which are incorrectly split due to lack of a few language-specific abbreviations. Unfortunately, as the sentences are shuffled there is no way to get the original sentence back. A few abbreviations I noticed are missing from sentence_tokenize.py : प्रा. (private), जि. (district). Abbreviations can be changed to preserve the ending '.' while tokenizing to avoid incorrect sentence splits.

A quick fix for this is limiting the sentence lengths to 5-50 words. Most of the sentences lying outside this region are affected by this. I have attached a sample errors.txt file which contains a few of the incorrectly split sentences. mr_errors.txt

anoopkunchukuttan / indic_nlp_library

Preserve abbreviation punctuation for Tokenization & adding more abbreviations for Sentence Splitting #30