Closed kosloot closed 6 years ago
At least in the python library I think I have a retaintokenisation
parameter (which indeed defaults to False), I guess libfolia has the same.
For FoLiA-stats it makes sense to not detokenise by default I'd say.
so @proycon suggests a --detokenize option, with the default being to NOT do that. Fine with me, but.... That would break the current default of FoLiA-stats, which is to detokenize. I am unsure if that would affect a lot of current work. (don't think so..) @martinreynaert please comment.
I am fine with proycon's suggestion too.
Ko, I do not actually see that FoLiA-stats default is to detokenize. I have recently run it on a few hundred years of KB newspapers, so the following is definitely not a statistically valid sample...
reynaert@red:/reddata/TEST$ grep '<t class="OCR"' ddd.010131457.mpeg21.a0016.folia.xml |grep -C 2 'offset="5171"'
<t class="OCR" offset="5167">kan</t>
<t class="OCR" offset="5171">toevloeyen</t>
<t class="OCR" offset="5182">;</t>
<t class="OCR" offset="5184">behoeft</t>
Produces the following ngrams:
reynaert@red:/reddata/TEST$ cat /reddata/Nederlab/KBkranten/FOLIAnottarred/FRQ/FOLIAstats.KBkranten1861.wordfreqlist.3-gram.tsv |grep 'toevloeyen ;' kan toevloeyen ; 1 38936848 81.0443 toevloeyen ; behoeft 1 43734946 91.0312
So: no sign of detokenization there!
Fake News! That FoliA is not tokenized (by ucto) at all! It even doesn't have \<s> to \<w> nodes. Is DOES have \<str> nodes, which are counted by FoLiA-stats. These nodes seem to have some form of tokenization, probably (and luckily!) introduced in the original Alto files. but that is NOT guaranteed I think. Bottom-line: adding a --detokenize option in FoLiA-stats, and reversing the behavior, seems feasible. And easy to implement.
implemented.
When performing FoLiA-stats on a sentence like this:
the folia::text() function is used to extract the sentence text, delivering:
†Stomper (knoeier)
which FoLiA-stats sees as a bigram.Maybe this is NOT what was intended! A reasonable thing to do would be to keep the tokenization giving the 5-gram:
† Stomper ( knoeier )
I suggest adding a parameter to FoLiA-stats:
--keep-tokens
that does so, if desired.