asyml / forte

Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
Apache License 2.0
239 stars 60 forks source link

StanfordNLP sentence segmenter bug #152

Closed atif93 closed 4 years ago

atif93 commented 4 years ago

While trying to find sentence boundaries, the technique to find the sentence ending can fail.
We are using find which gives the first occurrence of a word in a sentence. This will definitely fail when there are 2 duplicate words in a sentence.

https://github.com/asyml/forte/blob/master/forte/processors/stanfordnlp_processor.py#L72

hunterhector commented 4 years ago

Thanks for spotting this. Let's try to add the failure cases to tests, and come up with a universal solution for such cases.

hunterhector commented 4 years ago

Will close this and solve problems using a universal solution https://github.com/asyml/forte/issues/86