diasks2 / pragmatic_segmenter

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
MIT License
549 stars 55 forks source link

Unexpected sentence break when parentheses immediately follow abbreviation with period #46

Closed reczy closed 6 years ago

reczy commented 6 years ago

Hi Kevin - first of all, thanks for your work on this gem.

I'd like to report the following unexpected behavior:

Example 1: Unexpected Result:

Note the period in Inc.

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc.", "(“Company A”), and PragmaticSegmenterExampleCompanyB Inc.", "(“Company B”)."]

Example 2: Expected Result:

No period in Inc

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc (“Company A”), and PragmaticSegmenterExampleCompanyB Inc (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc (“Company A”), and PragmaticSegmenterExampleCompanyB Inc (“Company B”)."]

Example 3: Expected Result:

Note period in Inc. but now there's text between Inc. and the parens

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc., a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc., a fake corporation (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc., a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc., a fake corporation (“Company B”)."]

Example 4: Expected Result:

Same as Example 3 but without the comma after Inc.

PragmaticSegmenter::Segmenter.new(text: 'The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. a fake corporation (“Company B”).', clean: false).segment
["The parties to this Agreement are PragmaticSegmenterExampleCompanyA Inc. a fake corporation (“Company A”), and PragmaticSegmenterExampleCompanyB Inc. a fake corporation (“Company B”)."]
diasks2 commented 6 years ago

@reczy Thanks for reporting! I'll add those failing cases as specs and look into the issue.

diasks2 commented 6 years ago

@reczy this should now be fixed in the latest version: 0.3.19. Let me know if you still experience any issues.

reczy commented 6 years ago

Looks like it's working for me as well! Thank you @diasks2