diasks2 / pragmatic_segmenter

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
MIT License
549 stars 55 forks source link

Take advantage of non-breaking spaces #53

Open hftf opened 6 years ago

hftf commented 6 years ago

I have a corpus of text that often uses explicit non-breaking spaces (NBSP, U+00A0). They are mainly used to keep together words in the same sentence. They often appear after sentence-medial terminal punctuation (.!?) and before short sentence-final words (as in G. F. Handel composed Water Music for George I.). They were used in order to improve both text document layout and parsing.

Consider the following four cases:

1.  'Peter Pan is a J. M. Barrie play.'   # no NBSP
2.  'Peter Pan is a J. M. Barrie play.'   # NBSP after J.
3.  'Peter Pan is a J. M. Barrie play.'   # NBSP after M.
4.  'Peter Pan is a J. M. Barrie play.'   # NBSP after J. and M.

I was surprised that only the first case was segmented correctly out of the box:

1.  ['Peter Pan is a J. M. Barrie play.']
2. *['Peter Pan is a J.', ' M.', 'Barrie play.']
3. *['Peter Pan is a J. M.', ' Barrie play.']
4. *['Peter Pan is a J.', ' M.', ' Barrie play.']

Of course, it would be easy to just translate all of them to normal spaces and call it a day.


But non-breaking spaces can serve as useful disambiguation to a pragmatic sentence segmenter.

Not every sentence can be accurately segmented using a limited number of rules, so taking advantage of non-breaking spaces can improve the results in some trickier cases without making the checks much more complicated.

(Disclaimer: I have not looked at the rules and do not claim to understand the internals of the program.)

I will provide a few examples to demonstrate how this may be useful. You are welcome to incorporate them as new test cases, even if you ultimately decide not to bother with non-breaking spaces.


Cases 5–7 are extremely similar, but 7 surprisingly produces a different result. In case 8, adding an NBSP seems to fix the problem, presumably without making This behave like He and They internally.

5.  'Sri Lanka was conquered by the Cholas and Raja Raja I. They moved the capital.'   # no NBSP
6.  'Sri Lanka was conquered by Raja Raja I. He moved the capital to Polonnaruwa.'     # no NBSP
7.  'Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.'   # no NBSP
8.  'Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.'   # NBSP before I.

5.  ['Sri Lanka was conquered by the Cholas and Raja Raja I.', 'They moved the capital.']
6.  ['Sri Lanka was conquered by Raja Raja I.', 'He moved the capital to Polonnaruwa.']
7. *['Sri Lanka was conquered by Raja Raja I. This moved the capital to Polonnaruwa.']
8.  ['Sri Lanka was conquered by Raja Raja I.', 'This moved the capital to Polonnaruwa.']

Cases 9–10 show the same sort of “fix”:

9.   'Lu Xun wrote The True Story of Ah Q. He was a Chinese author.'   # no NBSP
10.  'Lu Xun wrote The True Story of Ah Q. He was a Chinese author.'   # NBSP before Q.

9.  *['Lu Xun wrote The True Story of Ah Q. He was a Chinese author.']
10.  ['Lu Xun wrote The True Story of Ah Q.', 'He was a Chinese author.']

Cases 11–14 show that it’s much more difficult than it looks as Feng S. He is a plausible Chinese name:

11.  'Feng S. He was a Chinese diplomat who secretly saved 3,000 Austrian Jews.'  # no NBSP
12.  'He said the story of Feng S. He was a secret. He was a Chinese diplomat.'   # no NBSP
13.  'He said the story of Feng S. He was a secret. He was a Chinese diplomat.'   # NBSP after S.
14.  'I learned the son of Feng S. He was a Chinese-American microbiologist.'     # no NBSP

11.  ['Feng S. He was a Chinese diplomat who secretly saved 3,000 Austrian Jews.']
12.  ['He said the story of Feng S. He was a secret.', 'He was a Chinese diplomat.']
13. *['He said the story of Feng S.', ' He was a secret.', 'He was a Chinese diplomat.']
14.  ['I learned the son of Feng S. He was a Chinese-American microbiologist.']

The non-breaking space currently has no effect in cases 15 and 16. However, they could be easily segmented correctly without adding more rules or rare words like .NET to an explicit list.

15.  'I want to learn Microsoft’s .NET framework.'   # no NBSP
16.  'I want to learn Microsoft’s .NET framework.'   # NBSP before .NET

15. *['I want to learn Microsoft’s .', 'NET framework.']
16. *['I want to learn Microsoft’s .', 'NET framework.']

Resolving this issue could help with the following:

  1. Improve accuracy for corpora that (partially) use an existing convention for keeping words together using non-breaking spaces.
  2. Improve the existing rules by scrutinizing issues with the cases provided.

I favor this program for my use case because of its stance on embedded quotations.