fnl / segtok

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
MIT License
170 stars 22 forks source link

Over-splitting on quotes with names. #16

Open jakepoz opened 6 years ago

jakepoz commented 6 years ago

We are seeing a few issues with segtok being over-eager to split quoted sentences with names directly after the quoted section.

Ex.

"Good morning," said Harry. "Good morning?" asked Harry. "Good morning!" exclaimed Harry.

All of those split correctly. However consider the following variations:

"Good morning," Harry said. -> Correct "Good morning?" Harry asked. -> Splits into two. "Good morning!" Harry exclaimed. -> Splits into two.

fnl commented 6 years ago

Yes, I agree - sentences in quotes that end in an exclamation or question mark and that are followed by so-called speech tag should never be split, even if the speech tag begins with a proper noun, as in your two examples with Harry that segtok currently oversplits. On Mon, Aug 13, 2018, at 19:28, Jake Poznanski wrote:

We are seeing a few issues with segtok being over-eager to split quoted> sentences with names directly after the quoted section.

Ex.

"Good morning," said Harry. "Good morning?" asked Harry. "Good morning!" exclaimed Harry.

All of those split correctly. However consider the following variations:

"Good morning," Harry said. -> Correct "Good morning?" Harry asked. -> Splits into two. "Good morning!" Harry exclaimed. -> Splits into two.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/fnl/segtok/issues/16