fnl / syntok

Text tokenization and sentence segmentation (segtok v2)
MIT License
200 stars 34 forks source link

Splitting on single \n for sentence tokenization #14

Closed divyeshlad18 closed 4 years ago

divyeshlad18 commented 4 years ago

Hello Florian,

Thank you for developing such a powerful NLP library. I gotta say, I have tried all the NLP libraries for sentence tokenization and none of them even comes closer to your creation.

I was just wondering is there a feature/parameter in segmenter.analysis which splits the sentences on the occurrence of single "\n".

Example: 1\n\nDissonance\n\nTuesday, February 2\nBoone Drake awoke before sunup with little recollection of the previous two days.

Right now the output is:

1

Dissonance

Tuesday, February 2\nBoone Drake awoke before sunup with little recollection of the previous two days.

Is there a way I can also split on single "\n", like this:

1

Dissonance

Tuesday, February

Boone Drake awoke before sunup with little recollection of the previous two days.

fnl commented 4 years ago

Thanks for the kind words, Divyesh!

The problem with making splitting on single-line-breaks possible is that it is a bit dangerous- in many cases, the break can be in the middle of a sentence, while if there is an empty newline in between, it certainly would be a paragraph break.

May I suggest you simply add a post-processor that spits the output once more on all single newlines, too? I guess that would be a simple lambda, if that really were all you needed.

On Tue, Aug 18, 2020, at 13:58, Divyesh Lad wrote:

Hello Florian,

Thank you for developing such a powerful NLP library. I gotta say, I have tried all the NLP libraries for sentence tokenization and none of them even comes closer to your creation.

I was just wondering is there a feature/parameter in segmenter.analysis which splits the sentences on the occurrence of single "\n".

Example: 1\n\nDissonance\n\nTuesday, February 2\nBoone Drake awoke before sunup with little recollection of the previous two days.

Right now the output is:

**1

Dissonance

Tuesday, February 2\nBoone Drake awoke before sunup with little recollection of the previous two days.**

Is there a way I can also split on single "\n", like this:

**1

Dissonance

Tuesday, February

Boone Drake awoke before sunup with little recollection of the previous two days.**

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fnl/syntok/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA2XPIUW5YGY4H4KZP6NK3SBJUFJANCNFSM4QDK4B7Q.

divyeshlad18 commented 4 years ago

Thank you for the quick response, Florian.

It does make sense to not break the sentence from the middle just due to a single newline occurrence.

Also, Thanks to your suggestion, I'm going to add a post-processor script to tokenize the sentence on "\n" detection.