Closed Klim314 closed 9 years ago
Sadly, we can't have it both ways: avoiding oversplitting on abbreviations and undersplitting on poor or hard to deduce orthography; segtok makes a choice for the former: It uses continuations (segtok/segmenter.py, lines 85ff) to detect words that typically form continuations and do not occur that frequently at the beginning of sentences. "to" is such a case, as are a number of others (lines 85ff).
So the issue you raise here implies that it might be interesting to investigate if it is worth it to shorten this list of continuations or not. In general, I have seen that using these continuations is very helpful, so it would be a mistake to remove them altogether. However, it would be great to find some empirical evidence if all continuations in use are indeed not frequently found at the beginning of sentences.
I would suggest to investigate something similar to the following snippet on a rather large corpus:
corpus = open("corpus.txt").read()
for cont in CONTINUATIONS:
cont_cap = r"[" + cont[0] + cont[0].upper() + r"]" + cont[1:]
val = len(re.findall(r"\.\s+" + cont_cap, corpus)) / \
len(re.findall(r"[^.]\s+" + cont, corpus))
print(val, cont)
for each continuation in use (lines 85ff). Then, one would order them by this value, and see if there are any that have a ratio that is significantly larger than most others (kind of "outliers"). Those continuations then indeed could be good candidates for removal, as that means they probably (!) are more frequently used at the beginning of sentences than the others.
Note that given words follow Pareto distributions, it might be helpful to take logs of the above counts to more clearly spot outliers.
Thanks, I'll take a look at it in a bit.
Ok, I've built a counting tool to look into the issue; The final REs I've used are slightly different. [d040eb6a675279c180]
If I count the tokens in the Brown corpus using the continuations used by segtok (at after and are as at between but by during for from has in into is nor of on or than that though to upon via was were whereas whether while with within) I get the following result, where Freq. SS is the percentage of times the word was used as a sentence starter:
Freq. SS | Total | Word |
---|---|---|
0.297 | 4279 | but |
0.225 | 40 | whereas |
0.222 | 582 | during |
0.221 | 1070 | after |
0.176 | 193 | nor |
0.171 | 677 | while |
0.144 | 439 | though |
0.105 | 363 | within |
0.092 | 284 | whether |
0.072 | 7239 | as |
0.060 | 9590 | for |
0.056 | 18 | via |
0.046 | 6754 | on |
0.036 | 495 | upon |
0.035 | 7283 | with |
0.034 | 4376 | from |
0.031 | 6104 | by |
0.031 | 10657 | that |
0.027 | 29009 | and |
0.019 | 4234 | or |
0.015 | 728 | between |
0.012 | 144194 | in |
0.009 | 4376 | are |
0.009 | 41273 | to |
0.006 | 10090 | is |
0.004 | 36804 | of |
0.004 | 104562 | at |
0.003 | 9810 | was |
0.003 | 2434 | has |
0.003 | 1788 | into |
0.001 | 3284 | were |
0.000 | 1796 | than |
Particularly the top words are good candidates to remove from the list of continuations. Overall, there seems strong evidence that during, after, and but were not that good choices for continuations, because there is a more than 20% base chance that they get used as a sentence starter. And there is relatively strong evidence for this, too (i.e., more than 500 cases each). However, the word that sparked this discussion - to - clearly is a good continuation word (<1% base chance, >40k examples).
I will look at this on a large scientific corpus (PMC OA), too, and see if that would change anything.
Huh, I'm actually surprised that certain terms like whereas showed up as sentence starters as frequently as they did. On 17 Jul 2015 21:08, "Florian Leitner" notifications@github.com wrote:
Ok, I've built a counting tool to look into the issue. Will add it here in a bit. If I count the tokens in the Brown corpus using the continuations used by segtok (at after and are as at between but by during for from has in into is nor of on or than that though to upon via was were whereas whether while with within) I get the following result, where Freq. SS is the percentage of times the word was used as a sentence starter: Freq. SS Total Word 0.297 4279 but 0.225 40 whereas 0.222 582 during 0.221 1070 after 0.176 193 nor 0.171 677 while 0.144 439 though 0.105 363 within 0.092 284 whether 0.072 7239 as 0.060 9590 for 0.056 18 via 0.046 6754 on 0.036 495 upon 0.035 7283 with 0.034 4376 from 0.031 6104 by 0.031 10657 that 0.027 29009 and 0.019 4234 or 0.015 728 between 0.012 144194 in 0.009 4376 are 0.009 41273 to 0.006 10090 is 0.004 36804 of 0.004 104562 at 0.003 9810 was 0.003 2434 has 0.003 1788 into 0.001 3284 were 0.000 1796 than
Particularly the top words are good candidates to remove from the list of continuations. Overall, there seems strong evidence that during, after, and but were not that good choices for continuations, because there is a more than 20% base chance that they get used as a sentence starter. And there is relatively strong evidence for this, too (i.e., more than 500 cases each). However, the word that sparked this discussion - to - clearly is a good continuation word (<1% base chance, >40k examples).
I will look at this on a large scientific corpus (PMC OA), too, and see if that would change anything.
— Reply to this email directly or view it on GitHub https://github.com/fnl/segtok/issues/9#issuecomment-122270291.
OK, here are the numbers from the PMC OA subset, with far more evidence than the Brown corpus.
Freq. SS | Total | Word |
---|---|---|
0.235 | 1806900 | while |
0.171 | 5151523 | after |
0.163 | 315016 | though |
0.104 | 579718 | upon |
0.086 | 90593460 | in |
0.078 | 3595872 | during |
0.071 | 19132841 | as |
0.065 | 828406 | whereas |
0.062 | 35933256 | for |
0.048 | 1139268 | whether |
0.047 | 13139815 | at |
0.045 | 2253047 | within |
0.035 | 16130066 | on |
0.030 | 59080048 | to |
0.030 | 191333 | nor |
0.028 | 4087338 | but |
0.018 | 15989250 | from |
0.013 | 23146113 | by |
0.009 | 33763397 | with |
0.007 | 6624178 | between |
0.005 | 877936 | via |
0.004 | 23118239 | that |
0.003 | 25999120 | is |
0.002 | 14766534 | are |
0.002 | 140872211 | of |
0.001 | 4951922 | has |
0.001 | 110277270 | and |
0.000 | 4565321 | than |
0.000 | 3129831 | into |
0.000 | 23297602 | was |
0.000 | 22533646 | were |
0.000 | 13331906 | or |
While, after, and, interestingly, though seem good candidates here. During still is a good runner-up, however, but seems far less so. This could be a bias, as using "but" in scientific texts is sometimes considered poor style.
Overall, I suggest to remove while, after, though, and but, plus possibly during from the current set of continuations. However, again, "to" is a no-fix: only 3% continuation usage is probably too little to justify its removal.
One final test I want to make before removing any words is trying to see if we can measure sufficient amounts of usages of those words right after abbreviations (i.e., where the lower-case continuation is preceded by a dot). If there were, we could make an even more informed decision; if not, it probably is impossible to discern the cases where this was an orthographic error from the cases where it actually was used after an abbreviation.
In the Brown corpus, only are, for, and and are used in such cases (after abbreviations), and as those are not removal candidates, this is not very relevant. Let's see what I can get from PMC, which takes a bit longer...
Here we go again, the proportion of cases where a continuation was used after an abbreviation marker, like the example that raised this issue. These last numbers are pretty astonishing/sobering and indicate that a number of continuations possibly were a Bad Idea.
The first column measures the percentage of cases where the word is used after an abbreviation marker over its combined use after a sentence end marker (i.e., column 2 divided by column 2+3). In other words, the top words in this list are excellent continuation markers, the further down the list we get, the less likely so.
Likelih. | N. abbrev. | N. starters | Word |
---|---|---|---|
0.807 | 18950 | 4536 | was |
0.732 | 2281 | 833 | into |
0.713 | 11689 | 4699 | were |
0.676 | 125059 | 60007 | and |
0.556 | 6776 | 5406 | has |
0.512 | 5887 | 5620 | or |
0.358 | 439 | 786 | than |
0.275 | 24804 | 65387 | is |
0.257 | 11022 | 31855 | are |
0.228 | 100770 | 341394 | of |
0.191 | 1118 | 4740 | via |
0.071 | 6242 | 81955 | that |
0.057 | 18510 | 304337 | with |
0.054 | 16472 | 290662 | from |
0.048 | 2263 | 44610 | between |
0.034 | 10653 | 303719 | by |
0.026 | 1446 | 54675 | whether |
0.018 | 105 | 5733 | nor |
0.017 | 10446 | 616634 | at |
0.016 | 9255 | 561735 | on |
0.014 | 1445 | 100729 | within |
0.012 | 1391 | 112907 | but |
0.011 | 25779 | 2230428 | for |
0.008 | 65529 | 7826016 | in |
0.007 | 2105 | 279543 | during |
0.007 | 12926 | 1793203 | to |
0.005 | 6544 | 1365077 | as |
0.004 | 3416 | 880313 | after |
0.004 | 194 | 53599 | whereas |
0.003 | 179 | 60536 | upon |
0.001 | 558 | 424063 | while |
0.001 | 32 | 51213 | though |
I would say that everything below the 1% mark (in, during, to, as, after, whereas, upon, while, and though) are candidates that really must be removed from the continuation list, because they are hardly ever used inside a sentence after an abbreviation marker in comparison to its use after a sentence end marker.
This largely coincides with/inverts the lists reported earlier, and adds in, to, as, and upon to the candidates for removal. With the exception of to, all words are fine a priori, as they are quite high up in the other list(s), too. Regarding to, we have: 13k times used after an abbrevation, 1.8M times as a sentence starter, and 57M times inside sentences. According to the statistic, we will convert less than 1% of all "tos" into sentence starters although we should not (naively assuming all orthography is correct) if we remove the word from the list of continuations. That is, it was (likely correctly) used in 1% of all cases after an abbreviation, in comparison to the remaining 99% were it was (likely correctly) used as a sentence starter. So @Klim314, I hope you are happy - to made it off the list! :-)
What bothers me a bit is the long list of words between 10 and 1% (that - for). They are all at least 10 times more likely to be used at a sentence start than after an abbreviation. That means, they might not warrant their use as continuation markers, but least according to this (pretty large) corpus, there are a substantial number of cases where the lower-case use did indicate a (likely correct) use as a continuation. So my judgement call for now is to leave them in, but it would be nice if somebody could add a second opinion by run my count_continuations.py
Python3 script on a different corpus. Here is how to use the script, particularly if you have your corpus split over many small files:
find ~/work/corpora/pmc/* -name "*.txt" \
| xargs cat \
| ./count_continuations.py and are at after as at \
between but by during for from has in into is nor \
of on or than that though through to upon via \
was were whereas whether while with within yet \
| sort -rn
If you have only one large file or a few small ones, just pipe it into count_continuations.py
.
I would really love to hear a second opinion and get another measurement from someone else, but for now, I will just remove the <1% abbreviation continuations after observing these "sobering" numbers.
While I'm done with my long rant above, I've detected one problem: I missed to count through and yet. So I'm running the counting script once more on PMC with only those two words and will update the tables ASAP. Once I get those last two figures and can decide if either of the two words can stay or needs to go, I will update segtok
and publish a new minor version in short.
Here is the current state of affairs, after figuring out a way how to separate the words into valid and invalid continuations:
Words likely used as sentence starters (poor continuations, >10%):
Words hardly used after abbreviations vs. sentence starters (poor continuations, <2%):
Words hardly ever used as sentence starters (excellent continuations, <2%):
Words frequently used after abbreviations (excellent continuations, >10%):
Grey zone: undecidable words -> leave in to bias towards under-splitting
Here are the missing two words:
Freq.SS | Likelih. | N.abbrev. | N.starters | N.inside | Word |
---|---|---|---|---|---|
0.019 | 0.042 | 1632 | 37067 | 1956701 | through |
0.140 | 0.001 | 31 | 51092 | 314745 | yet |
So yet is another candidate that will be dropped, through on the other hand is OK.
Here we go, v1.5.0 implements the improvement and is live on PyPI.
Noticed a case of a missed split of a seemingly simple structure.
The above text should have been split as follows