Closed lixiangnlp closed 9 years ago
If a split is ambiguous, as in this case, or does not follow correct orthography, as is also the case, the patterns segtok uses do not indicate a split.
I.e., if you wish to use segtok on "noisy" texts with poor orthography, you might find yourself in trouble.
Either clean up those cases prior to feeding data to segtok or you might be better of with a statistical splitter like Punkt, which tend to oversplit on text with good orthography.
Hope this advice helps you somehow and will close the issue.
split_single("I love you. he said he hated you ! and he is a boy.")
the result is I love you. he said he hated you ! and he is a boy.
I think the correct result is
I love you. he said he hated you ! and he is a boy.