The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc

burhanharoon / N-Gram-Language-Model

It's a python based n-gram langauage model which calculates bigrams, probability and smooth probability (laplace) of a sentence using bi-gram and perplexity of the model.

6 stars 2 forks source link

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc #1

Open burhanharoon opened 2 years ago

burhanharoon commented 2 years ago

Right now the tokenize() function is splitting whenever a ' . ' character is found. Most of the time it's a correct approach to split a fine into sentences but sometimes the abbreviation like Dr., Mr., Mrs, etc. appear in a middle of a sentence and hence splits the sentence right there. I want to enhance the regex to not to spit the sentences on abbreviations.

DaudAhmad0303 commented 2 years ago

Please assign this issue to me

burhanharoon commented 2 years ago

@DaudAhmad0303 Do you still want to work on it?