Improper segmentation with proper names where middle initial is abbreviated

christian-storm commented 7 years ago

I've been trying out segtok and comparing it against the segmentations produced by coreNLP, spacy, and punkt on the English wikipedia.

Segtok is working well in some cases where the others fail. However, there is one chronic case where segtok always splits prematurely which is particularly noticeable given wikipedia's encyclopedic nature. In particular, any sentence with . seems to always split on the period after the middle initial.

Example of segtok splits: By 1968, Leger had returned to Canada's capital and was appointed as under-secretary of state, providing the administrative basis for Prime Minister Lester B. Pearson's foreign policy, and the policies on bilingualism and multiculturalism developed by the Cabinet chaired by Pearson's successor, Pierre Trudeau.

This simple model is commonly known as the adjacency list model, and was introduced by Dr. Edgar F. Codd after initial criticisms surfaced that the relational model could not model hierarchical data.

Seems like a bug but perhaps it is by design.

fnl commented 7 years ago

Thank you for pointing this out, it certainly seems like something that should be improved. I assume it should be possible to add a rule to not split on a surrounding like "Aaaa A. Aaaa". All it would take is adding a few test-cases and then fixing them so that all others don't break. Feel free to submit a PR, or otherwise I will look into it myself once I find some free time.

fnl commented 7 years ago

Done (69dc258), the feature is added. Christian, in case you find any other significant cases, please let me know. Thank you.

dpmccabe commented 7 years ago

Here are some examples of similar issues I've come across in a particular document:

Read and digest the booklet by D. Seyfort Ruegg and the introductory matter of A. MacDonald's tomes, both of which are on file. Further, the following books are highly recommended: G.B.J. Dreyfus and S.L. McClintock, The Svātantrika-Prāsaṅgika Distinction (Boston: Wisdom Publications, 2003). D. Seyfort Ruegg, The Buddhist Philosophy of the Middle (Boston: Wisdom Publications, 2010). K.A. Vose, Resurrecting Candrakīrti (Boston: Wisdom Publications, 2009). These provide essential background reading for this course.

segments as

Read and digest the booklet by D. Seyfort Ruegg and the introductory matter of A.
MacDonald's tomes, both of which are on file.
Further, the following books are highly recommended: G.B.J.
Dreyfus and S.L. McClintock, The Svātantrika-Prāsaṅgika Distinction (Boston: Wisdom Publications, 2003).
D. Seyfort Ruegg, The Buddhist Philosophy of the Middle (Boston: Wisdom Publications, 2010).
K.A.
Vose, Resurrecting Candrakīrti (Boston: Wisdom Publications, 2009).
These provide essential background reading for this course.

fnl commented 7 years ago

Thanks for reporting these three cases, Devin. However, they are not really similar to the original issue. Overall, I think these cases should be moved to one (or two, see below) new tickets, because none of the cases are directly related to the earlier issue, which was "First M. Second", so I will leave this issue closed.

The first one is an entirely different kind of issue, it is not even a middle name ("A. MacDonald").

The other two cases ("G.B.J. Dreyfus" and K.A. Vose") are a bit more tricky and are not the same class of problems as Christian brought up; There is no consensus whether two dots should be used at the end of the sentence after an abbreviation or not, so cases like"...in the U.S.A. However, ..." would not be properly split if those names were universally enforced to remain joined. For the second case (only) I see a good chance that, given it is right at the beginning of the sentence, the algorithm should preferentially not split such cases. But the Dreyfus case is something you'd probably need more context information to split correctly, like PoS tagging - which is beyond the scope of this regex-patterns-only splitter. I.e., those cases will take some research, I guess...

fnl commented 7 years ago

OK, opened issue #12 for the first case and #13 for the (two) second case(s).

fnl / segtok

Improper segmentation with proper names where middle initial is abbreviated #10