JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Fix sentence splitter: sentences ending with acronyms #24

Closed nickto closed 5 years ago

nickto commented 5 years ago

A sentence ending with an acronym was dot distinguished from an initial followed by a period in the middle of the sentence. E.g., "Adamson is not from USA. They are from Europe" was considered as a single sentence because "USA." was treated as an initial.

  1. Added a requirement for the initial to be only one letter long: "A." in "A. Adamson" is still treated as an initial, while "USA" in "from USA. They" is not.
  2. Added a test for such case.
codecov-io commented 5 years ago

Codecov Report

Merging #24 into master will not change coverage. The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #24   +/-   ##
=======================================
  Coverage   80.86%   80.86%           
=======================================
  Files           9        9           
  Lines         277      277           
=======================================
  Hits          224      224           
  Misses         53       53
Impacted Files Coverage Δ
src/sentences/sentence_splitting.jl 88.13% <100%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4d877dc...e3f8d93. Read the comment docs.

oxinabox commented 5 years ago

Good catch, do you think it would be better use use \b rather than \s ? \b is word-boundry, where as \s is white-space. I think \b would work better if at the start of the string,

nickto commented 5 years ago

Good catch, do you think it would be better use use \b rather than \s ? \b is word-boundry, where as \s is white-space. I think \b would work better if at the start of the string,

Changed it to \b and added a test that fails when using \s, thanks!

oxinabox commented 5 years ago

LGTM, will merge when tests pass

oxinabox commented 5 years ago

thank!