fnl / syntok

Text tokenization and sentence segmentation (segtok v2)
MIT License
201 stars 34 forks source link

Fix false positive month abbreviation #24

Closed peter-lang-dealogic closed 2 years ago

peter-lang-dealogic commented 2 years ago

Fixes #22 Some month abbreviations are also valid English words (e.g.: "set", "ago"), which cases false positives if we are treating month abberviations as generic abbrevations, as it would consume such sentence endings. Code already contains logic to check if month abbreviation is preceeded by number, see test-case: "Am 13. Jän. 2006 war es regnerisch."

fnl commented 2 years ago

Needed one more update to support short numbers (2 digits or less), but now the month handling still works as long as there is any kind of digit context after the month. Thank you for the fix!