aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.3k stars 337 forks source link

Usage if newline character in sentence tokenizer #146

Open devikasondhi opened 6 years ago

devikasondhi commented 6 years ago

Hello,

I'm not sure if treating newline character as a sentence breaker is a valid way to go about. Consider input like: "Hey,\n I'm still reading."

Shouldn't this qualify as a single sentence? The sentence tokenizer splits this into two, which I feel shouldn't be the case:

Text("Hey,\n I'm still reading.").sentences [Sentence("Hey,"), Sentence("I'm still reading.")]

Am I missing something?

Thanks.