Failure to split on abberviations

Klim314 commented 9 years ago

Noticed a case of a missed split of a seemingly simple structure.

'colonic colonization of clostridium spp. is associated with accumulation of tregs, which inhibits development of inflammatory lesions. to investigate whether infection with the clostridium leptum sp. can specifically induce tregs and/or tdcs bone marrow-derived dendritic cells were cultured in the presence or absence of c. leptum then co-cultured with cd4(+)cd25(-) t cells or not.'

The above text should have been split as follows

'colonic colonization of clostridium spp. is associated with accumulation of tregs, which inhibits development of inflammatory lesions.

to investigate whether infection with the clostridium leptum sp. can specifically induce tregs and/or tdcs bone marrow-derived dendritic cells were cultured in the presence or absence of c. leptum then co-cultured with cd4(+)cd25(-) t cells or not.'

fnl commented 9 years ago

Sadly, we can't have it both ways: avoiding oversplitting on abbreviations and undersplitting on poor or hard to deduce orthography; segtok makes a choice for the former: It uses continuations (segtok/segmenter.py, lines 85ff) to detect words that typically form continuations and do not occur that frequently at the beginning of sentences. "to" is such a case, as are a number of others (lines 85ff).

So the issue you raise here implies that it might be interesting to investigate if it is worth it to shorten this list of continuations or not. In general, I have seen that using these continuations is very helpful, so it would be a mistake to remove them altogether. However, it would be great to find some empirical evidence if all continuations in use are indeed not frequently found at the beginning of sentences.

I would suggest to investigate something similar to the following snippet on a rather large corpus:

corpus = open("corpus.txt").read()

for cont in CONTINUATIONS:
    cont_cap = r"[" + cont[0] + cont[0].upper() + r"]" + cont[1:]
    val = len(re.findall(r"\.\s+" + cont_cap, corpus)) / \
        len(re.findall(r"[^.]\s+" + cont, corpus))
    print(val, cont)

for each continuation in use (lines 85ff). Then, one would order them by this value, and see if there are any that have a ratio that is significantly larger than most others (kind of "outliers"). Those continuations then indeed could be good candidates for removal, as that means they probably (!) are more frequently used at the beginning of sentences than the others.

Note that given words follow Pareto distributions, it might be helpful to take logs of the above counts to more clearly spot outliers.

Klim314 commented 9 years ago

Thanks, I'll take a look at it in a bit.

fnl commented 9 years ago

Ok, I've built a counting tool to look into the issue; The final REs I've used are slightly different. [d040eb6a675279c180]

If I count the tokens in the Brown corpus using the continuations used by segtok (at after and are as at between but by during for from has in into is nor of on or than that though to upon via was were whereas whether while with within) I get the following result, where Freq. SS is the percentage of times the word was used as a sentence starter:

Freq. SS	Total	Word
0.297	4279	but
0.225	40	whereas
0.222	582	during
0.221	1070	after
0.176	193	nor
0.171	677	while
0.144	439	though
0.105	363	within
0.092	284	whether
0.072	7239	as
0.060	9590	for
0.056	18	via
0.046	6754	on
0.036	495	upon
0.035	7283	with
0.034	4376	from
0.031	6104	by
0.031	10657	that
0.027	29009	and
0.019	4234	or
0.015	728	between
0.012	144194	in
0.009	4376	are
0.009	41273	to
0.006	10090	is
0.004	36804	of
0.004	104562	at
0.003	9810	was
0.003	2434	has
0.003	1788	into
0.001	3284	were
0.000	1796	than

Particularly the top words are good candidates to remove from the list of continuations. Overall, there seems strong evidence that during, after, and but were not that good choices for continuations, because there is a more than 20% base chance that they get used as a sentence starter. And there is relatively strong evidence for this, too (i.e., more than 500 cases each). However, the word that sparked this discussion - to - clearly is a good continuation word (<1% base chance, >40k examples).

I will look at this on a large scientific corpus (PMC OA), too, and see if that would change anything.

Klim314 commented 9 years ago

Huh, I'm actually surprised that certain terms like whereas showed up as sentence starters as frequently as they did. On 17 Jul 2015 21:08, "Florian Leitner" notifications@github.com wrote:

Ok, I've built a counting tool to look into the issue. Will add it here in a bit. If I count the tokens in the Brown corpus using the continuations used by segtok (at after and are as at between but by during for from has in into is nor of on or than that though to upon via was were whereas whether while with within) I get the following result, where Freq. SS is the percentage of times the word was used as a sentence starter: Freq. SS Total Word 0.297 4279 but 0.225 40 whereas 0.222 582 during 0.221 1070 after 0.176 193 nor 0.171 677 while 0.144 439 though 0.105 363 within 0.092 284 whether 0.072 7239 as 0.060 9590 for 0.056 18 via 0.046 6754 on 0.036 495 upon 0.035 7283 with 0.034 4376 from 0.031 6104 by 0.031 10657 that 0.027 29009 and 0.019 4234 or 0.015 728 between 0.012 144194 in 0.009 4376 are 0.009 41273 to 0.006 10090 is 0.004 36804 of 0.004 104562 at 0.003 9810 was 0.003 2434 has 0.003 1788 into 0.001 3284 were 0.000 1796 than

Particularly the top words are good candidates to remove from the list of continuations. Overall, there seems strong evidence that during, after, and but were not that good choices for continuations, because there is a more than 20% base chance that they get used as a sentence starter. And there is relatively strong evidence for this, too (i.e., more than 500 cases each). However, the word that sparked this discussion - to - clearly is a good continuation word (<1% base chance, >40k examples).

I will look at this on a large scientific corpus (PMC OA), too, and see if that would change anything.

— Reply to this email directly or view it on GitHub https://github.com/fnl/segtok/issues/9#issuecomment-122270291.

fnl commented 9 years ago

OK, here are the numbers from the PMC OA subset, with far more evidence than the Brown corpus.

Freq. SS	Total	Word
0.235	1806900	while
0.171	5151523	after
0.163	315016	though
0.104	579718	upon
0.086	90593460	in
0.078	3595872	during
0.071	19132841	as
0.065	828406	whereas
0.062	35933256	for
0.048	1139268	whether
0.047	13139815	at
0.045	2253047	within
0.035	16130066	on
0.030	59080048	to
0.030	191333	nor
0.028	4087338	but
0.018	15989250	from
0.013	23146113	by
0.009	33763397	with
0.007	6624178	between
0.005	877936	via
0.004	23118239	that
0.003	25999120	is
0.002	14766534	are
0.002	140872211	of
0.001	4951922	has
0.001	110277270	and
0.000	4565321	than
0.000	3129831	into
0.000	23297602	was
0.000	22533646	were
0.000	13331906	or

While, after, and, interestingly, though seem good candidates here. During still is a good runner-up, however, but seems far less so. This could be a bias, as using "but" in scientific texts is sometimes considered poor style.

Overall, I suggest to remove while, after, though, and but, plus possibly during from the current set of continuations. However, again, "to" is a no-fix: only 3% continuation usage is probably too little to justify its removal.

One final test I want to make before removing any words is trying to see if we can measure sufficient amounts of usages of those words right after abbreviations (i.e., where the lower-case continuation is preceded by a dot). If there were, we could make an even more informed decision; if not, it probably is impossible to discern the cases where this was an orthographic error from the cases where it actually was used after an abbreviation.

fnl commented 9 years ago

In the Brown corpus, only are, for, and and are used in such cases (after abbreviations), and as those are not removal candidates, this is not very relevant. Let's see what I can get from PMC, which takes a bit longer...

fnl commented 9 years ago

Here we go again, the proportion of cases where a continuation was used after an abbreviation marker, like the example that raised this issue. These last numbers are pretty astonishing/sobering and indicate that a number of continuations possibly were a Bad Idea.

The first column measures the percentage of cases where the word is used after an abbreviation marker over its combined use after a sentence end marker (i.e., column 2 divided by column 2+3). In other words, the top words in this list are excellent continuation markers, the further down the list we get, the less likely so.

Likelih.	N. abbrev.	N. starters	Word
0.807	18950	4536	was
0.732	2281	833	into
0.713	11689	4699	were
0.676	125059	60007	and
0.556	6776	5406	has
0.512	5887	5620	or
0.358	439	786	than
0.275	24804	65387	is
0.257	11022	31855	are
0.228	100770	341394	of
0.191	1118	4740	via
0.071	6242	81955	that
0.057	18510	304337	with
0.054	16472	290662	from
0.048	2263	44610	between
0.034	10653	303719	by
0.026	1446	54675	whether
0.018	105	5733	nor
0.017	10446	616634	at
0.016	9255	561735	on
0.014	1445	100729	within
0.012	1391	112907	but
0.011	25779	2230428	for
0.008	65529	7826016	in
0.007	2105	279543	during
0.007	12926	1793203	to
0.005	6544	1365077	as
0.004	3416	880313	after
0.004	194	53599	whereas
0.003	179	60536	upon
0.001	558	424063	while
0.001	32	51213	though

I would say that everything below the 1% mark (in, during, to, as, after, whereas, upon, while, and though) are candidates that really must be removed from the continuation list, because they are hardly ever used inside a sentence after an abbreviation marker in comparison to its use after a sentence end marker.

This largely coincides with/inverts the lists reported earlier, and adds in, to, as, and upon to the candidates for removal. With the exception of to, all words are fine a priori, as they are quite high up in the other list(s), too. Regarding to, we have: 13k times used after an abbrevation, 1.8M times as a sentence starter, and 57M times inside sentences. According to the statistic, we will convert less than 1% of all "tos" into sentence starters although we should not (naively assuming all orthography is correct) if we remove the word from the list of continuations. That is, it was (likely correctly) used in 1% of all cases after an abbreviation, in comparison to the remaining 99% were it was (likely correctly) used as a sentence starter. So @Klim314, I hope you are happy - to made it off the list! :-)

What bothers me a bit is the long list of words between 10 and 1% (that - for). They are all at least 10 times more likely to be used at a sentence start than after an abbreviation. That means, they might not warrant their use as continuation markers, but least according to this (pretty large) corpus, there are a substantial number of cases where the lower-case use did indicate a (likely correct) use as a continuation. So my judgement call for now is to leave them in, but it would be nice if somebody could add a second opinion by run my count_continuations.py Python3 script on a different corpus. Here is how to use the script, particularly if you have your corpus split over many small files:

find ~/work/corpora/pmc/* -name "*.txt" \
| xargs cat \
| ./count_continuations.py and are at after as at \
  between but by during for from has in into is nor \
  of on or than that though through to upon via \
  was were whereas whether while with within yet \
| sort -rn

If you have only one large file or a few small ones, just pipe it into count_continuations.py. I would really love to hear a second opinion and get another measurement from someone else, but for now, I will just remove the <1% abbreviation continuations after observing these "sobering" numbers.

fnl commented 9 years ago

While I'm done with my long rant above, I've detected one problem: I missed to count through and yet. So I'm running the counting script once more on PMC with only those two words and will update the tables ASAP. Once I get those last two figures and can decide if either of the two words can stay or needs to go, I will update segtok and publish a new minor version in short.

Here is the current state of affairs, after figuring out a way how to separate the words into valid and invalid continuations:

PMC OA corpus statistics

Words likely used as sentence starters (poor continuations, >10%):

after, though, upon, while, yet

Words hardly used after abbreviations vs. sentence starters (poor continuations, <2%):

[after], as, at, but, during, for, in, nor, on, to, [though], [upon], whereas, [while], within

Words hardly ever used as sentence starters (excellent continuations, <2%):

and, are, between, by, from, has, into, is, of, or, that, than, through, via, was, were, with

Words frequently used after abbreviations (excellent continuations, >10%):

[and, are, has, into, is, of, or, than, via, was, were]

Grey zone: undecidable words -> leave in to bias towards under-splitting

whether

fnl commented 9 years ago

Here are the missing two words:

Freq.SS	Likelih.	N.abbrev.	N.starters	N.inside	Word
0.019	0.042	1632	37067	1956701	through
0.140	0.001	31	51092	314745	yet

So yet is another candidate that will be dropped, through on the other hand is OK.

fnl commented 9 years ago

Here we go, v1.5.0 implements the improvement and is live on PyPI.

fnl / segtok

Failure to split on abberviations #9

PMC OA corpus statistics