fnl / segtok

Segtok v2 is here: https://github.com/fnl/syntok -- A rule-based sentence segmenter (splitter) and a word tokenizer using orthographic features.
http://fnl.es/segtok-a-segmentation-and-tokenization-library.html
MIT License
171 stars 22 forks source link

Failure to split on abberviations #9

Closed Klim314 closed 9 years ago

Klim314 commented 9 years ago

Noticed a case of a missed split of a seemingly simple structure.

'colonic colonization of clostridium spp. is associated with accumulation of tregs, which inhibits development of inflammatory lesions. to investigate whether infection with the clostridium leptum sp. can specifically induce tregs and/or tdcs bone marrow-derived dendritic cells were cultured in the presence or absence of c. leptum then co-cultured with cd4(+)cd25(-) t cells or not.'

The above text should have been split as follows

'colonic colonization of clostridium spp. is associated with accumulation of tregs, which inhibits development of inflammatory lesions.

to investigate whether infection with the clostridium leptum sp. can specifically induce tregs and/or tdcs bone marrow-derived dendritic cells were cultured in the presence or absence of c. leptum then co-cultured with cd4(+)cd25(-) t cells or not.'

fnl commented 9 years ago

Sadly, we can't have it both ways: avoiding oversplitting on abbreviations and undersplitting on poor or hard to deduce orthography; segtok makes a choice for the former: It uses continuations (segtok/segmenter.py, lines 85ff) to detect words that typically form continuations and do not occur that frequently at the beginning of sentences. "to" is such a case, as are a number of others (lines 85ff).

So the issue you raise here implies that it might be interesting to investigate if it is worth it to shorten this list of continuations or not. In general, I have seen that using these continuations is very helpful, so it would be a mistake to remove them altogether. However, it would be great to find some empirical evidence if all continuations in use are indeed not frequently found at the beginning of sentences.

I would suggest to investigate something similar to the following snippet on a rather large corpus:

corpus = open("corpus.txt").read()

for cont in CONTINUATIONS:
    cont_cap = r"[" + cont[0] + cont[0].upper() + r"]" + cont[1:]
    val = len(re.findall(r"\.\s+" + cont_cap, corpus)) / \
        len(re.findall(r"[^.]\s+" + cont, corpus))
    print(val, cont)

for each continuation in use (lines 85ff). Then, one would order them by this value, and see if there are any that have a ratio that is significantly larger than most others (kind of "outliers"). Those continuations then indeed could be good candidates for removal, as that means they probably (!) are more frequently used at the beginning of sentences than the others.

Note that given words follow Pareto distributions, it might be helpful to take logs of the above counts to more clearly spot outliers.

Klim314 commented 9 years ago

Thanks, I'll take a look at it in a bit.

fnl commented 9 years ago

Ok, I've built a counting tool to look into the issue; The final REs I've used are slightly different. [d040eb6a675279c180]

If I count the tokens in the Brown corpus using the continuations used by segtok (at after and are as at between but by during for from has in into is nor of on or than that though to upon via was were whereas whether while with within) I get the following result, where Freq. SS is the percentage of times the word was used as a sentence starter:

Freq. SS Total Word
0.297 4279 but
0.225 40 whereas
0.222 582 during
0.221 1070 after
0.176 193 nor
0.171 677 while
0.144 439 though
0.105 363 within
0.092 284 whether
0.072 7239 as
0.060 9590 for
0.056 18 via
0.046 6754 on
0.036 495 upon
0.035 7283 with
0.034 4376 from
0.031 6104 by
0.031 10657 that
0.027 29009 and
0.019 4234 or
0.015 728 between
0.012 144194 in
0.009 4376 are
0.009 41273 to
0.006 10090 is
0.004 36804 of
0.004 104562 at
0.003 9810 was
0.003 2434 has
0.003 1788 into
0.001 3284 were
0.000 1796 than

Particularly the top words are good candidates to remove from the list of continuations. Overall, there seems strong evidence that during, after, and but were not that good choices for continuations, because there is a more than 20% base chance that they get used as a sentence starter. And there is relatively strong evidence for this, too (i.e., more than 500 cases each). However, the word that sparked this discussion - to - clearly is a good continuation word (<1% base chance, >40k examples).

I will look at this on a large scientific corpus (PMC OA), too, and see if that would change anything.

Klim314 commented 9 years ago

Huh, I'm actually surprised that certain terms like whereas showed up as sentence starters as frequently as they did. On 17 Jul 2015 21:08, "Florian Leitner" notifications@github.com wrote:

Ok, I've built a counting tool to look into the issue. Will add it here in a bit. If I count the tokens in the Brown corpus using the continuations used by segtok (at after and are as at between but by during for from has in into is nor of on or than that though to upon via was were whereas whether while with within) I get the following result, where Freq. SS is the percentage of times the word was used as a sentence starter: Freq. SS Total Word 0.297 4279 but 0.225 40 whereas 0.222 582 during 0.221 1070 after 0.176 193 nor 0.171 677 while 0.144 439 though 0.105 363 within 0.092 284 whether 0.072 7239 as 0.060 9590 for 0.056 18 via 0.046 6754 on 0.036 495 upon 0.035 7283 with 0.034 4376 from 0.031 6104 by 0.031 10657 that 0.027 29009 and 0.019 4234 or 0.015 728 between 0.012 144194 in 0.009 4376 are 0.009 41273 to 0.006 10090 is 0.004 36804 of 0.004 104562 at 0.003 9810 was 0.003 2434 has 0.003 1788 into 0.001 3284 were 0.000 1796 than

Particularly the top words are good candidates to remove from the list of continuations. Overall, there seems strong evidence that during, after, and but were not that good choices for continuations, because there is a more than 20% base chance that they get used as a sentence starter. And there is relatively strong evidence for this, too (i.e., more than 500 cases each). However, the word that sparked this discussion - to - clearly is a good continuation word (<1% base chance, >40k examples).

I will look at this on a large scientific corpus (PMC OA), too, and see if that would change anything.

— Reply to this email directly or view it on GitHub https://github.com/fnl/segtok/issues/9#issuecomment-122270291.

fnl commented 9 years ago

OK, here are the numbers from the PMC OA subset, with far more evidence than the Brown corpus.

Freq. SS Total Word
0.235 1806900 while
0.171 5151523 after
0.163 315016 though
0.104 579718 upon
0.086 90593460 in
0.078 3595872 during
0.071 19132841 as
0.065 828406 whereas
0.062 35933256 for
0.048 1139268 whether
0.047 13139815 at
0.045 2253047 within
0.035 16130066 on
0.030 59080048 to
0.030 191333 nor
0.028 4087338 but
0.018 15989250 from
0.013 23146113 by
0.009 33763397 with
0.007 6624178 between
0.005 877936 via
0.004 23118239 that
0.003 25999120 is
0.002 14766534 are
0.002 140872211 of
0.001 4951922 has
0.001 110277270 and
0.000 4565321 than
0.000 3129831 into
0.000 23297602 was
0.000 22533646 were
0.000 13331906 or

While, after, and, interestingly, though seem good candidates here. During still is a good runner-up, however, but seems far less so. This could be a bias, as using "but" in scientific texts is sometimes considered poor style.

Overall, I suggest to remove while, after, though, and but, plus possibly during from the current set of continuations. However, again, "to" is a no-fix: only 3% continuation usage is probably too little to justify its removal.

One final test I want to make before removing any words is trying to see if we can measure sufficient amounts of usages of those words right after abbreviations (i.e., where the lower-case continuation is preceded by a dot). If there were, we could make an even more informed decision; if not, it probably is impossible to discern the cases where this was an orthographic error from the cases where it actually was used after an abbreviation.

fnl commented 9 years ago

In the Brown corpus, only are, for, and and are used in such cases (after abbreviations), and as those are not removal candidates, this is not very relevant. Let's see what I can get from PMC, which takes a bit longer...

fnl commented 9 years ago

Here we go again, the proportion of cases where a continuation was used after an abbreviation marker, like the example that raised this issue. These last numbers are pretty astonishing/sobering and indicate that a number of continuations possibly were a Bad Idea.

The first column measures the percentage of cases where the word is used after an abbreviation marker over its combined use after a sentence end marker (i.e., column 2 divided by column 2+3). In other words, the top words in this list are excellent continuation markers, the further down the list we get, the less likely so.

Likelih. N. abbrev. N. starters Word
0.807 18950 4536 was
0.732 2281 833 into
0.713 11689 4699 were
0.676 125059 60007 and
0.556 6776 5406 has
0.512 5887 5620 or
0.358 439 786 than
0.275 24804 65387 is
0.257 11022 31855 are
0.228 100770 341394 of
0.191 1118 4740 via
0.071 6242 81955 that
0.057 18510 304337 with
0.054 16472 290662 from
0.048 2263 44610 between
0.034 10653 303719 by
0.026 1446 54675 whether
0.018 105 5733 nor
0.017 10446 616634 at
0.016 9255 561735 on
0.014 1445 100729 within
0.012 1391 112907 but
0.011 25779 2230428 for
0.008 65529 7826016 in
0.007 2105 279543 during
0.007 12926 1793203 to
0.005 6544 1365077 as
0.004 3416 880313 after
0.004 194 53599 whereas
0.003 179 60536 upon
0.001 558 424063 while
0.001 32 51213 though

I would say that everything below the 1% mark (in, during, to, as, after, whereas, upon, while, and though) are candidates that really must be removed from the continuation list, because they are hardly ever used inside a sentence after an abbreviation marker in comparison to its use after a sentence end marker.

This largely coincides with/inverts the lists reported earlier, and adds in, to, as, and upon to the candidates for removal. With the exception of to, all words are fine a priori, as they are quite high up in the other list(s), too. Regarding to, we have: 13k times used after an abbrevation, 1.8M times as a sentence starter, and 57M times inside sentences. According to the statistic, we will convert less than 1% of all "tos" into sentence starters although we should not (naively assuming all orthography is correct) if we remove the word from the list of continuations. That is, it was (likely correctly) used in 1% of all cases after an abbreviation, in comparison to the remaining 99% were it was (likely correctly) used as a sentence starter. So @Klim314, I hope you are happy - to made it off the list! :-)

What bothers me a bit is the long list of words between 10 and 1% (that - for). They are all at least 10 times more likely to be used at a sentence start than after an abbreviation. That means, they might not warrant their use as continuation markers, but least according to this (pretty large) corpus, there are a substantial number of cases where the lower-case use did indicate a (likely correct) use as a continuation. So my judgement call for now is to leave them in, but it would be nice if somebody could add a second opinion by run my count_continuations.py Python3 script on a different corpus. Here is how to use the script, particularly if you have your corpus split over many small files:

find ~/work/corpora/pmc/* -name "*.txt" \
| xargs cat \
| ./count_continuations.py and are at after as at \
  between but by during for from has in into is nor \
  of on or than that though through to upon via \
  was were whereas whether while with within yet \
| sort -rn

If you have only one large file or a few small ones, just pipe it into count_continuations.py. I would really love to hear a second opinion and get another measurement from someone else, but for now, I will just remove the <1% abbreviation continuations after observing these "sobering" numbers.

fnl commented 9 years ago

While I'm done with my long rant above, I've detected one problem: I missed to count through and yet. So I'm running the counting script once more on PMC with only those two words and will update the tables ASAP. Once I get those last two figures and can decide if either of the two words can stay or needs to go, I will update segtok and publish a new minor version in short.

Here is the current state of affairs, after figuring out a way how to separate the words into valid and invalid continuations:

PMC OA corpus statistics

Words likely used as sentence starters (poor continuations, >10%):

Words hardly used after abbreviations vs. sentence starters (poor continuations, <2%):

Words hardly ever used as sentence starters (excellent continuations, <2%):

Words frequently used after abbreviations (excellent continuations, >10%):

Grey zone: undecidable words -> leave in to bias towards under-splitting

fnl commented 9 years ago

Here are the missing two words:

Freq.SS Likelih. N.abbrev. N.starters N.inside Word
0.019 0.042 1632 37067 1956701 through
0.140 0.001 31 51092 314745 yet

So yet is another candidate that will be dropped, through on the other hand is OK.

fnl commented 9 years ago

Here we go, v1.5.0 implements the improvement and is live on PyPI.