fnl / segtok

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
MIT License
170 stars 22 forks source link

An issue about segment #4

Closed lixiangnlp closed 9 years ago

lixiangnlp commented 9 years ago

split_single("I love you. he said he hated you ! and he is a boy.")

the result is I love you. he said he hated you ! and he is a boy.

I think the correct result is

I love you. he said he hated you ! and he is a boy.

fnl commented 9 years ago

If a split is ambiguous, as in this case, or does not follow correct orthography, as is also the case, the patterns segtok uses do not indicate a split.

I.e., if you wish to use segtok on "noisy" texts with poor orthography, you might find yourself in trouble.

Either clean up those cases prior to feeding data to segtok or you might be better of with a statistical splitter like Punkt, which tend to oversplit on text with good orthography.

Hope this advice helps you somehow and will close the issue.