FoLiA-stats discards tokenization

LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)

https://proycon.github.io/folia

GNU General Public License v3.0

4 stars 3 forks source link

FoLiA-stats discards tokenization #26

Closed kosloot closed 6 years ago

kosloot commented 6 years ago

When performing FoLiA-stats on a sentence like this:

      <s xml:id="VanGinniken.p.1.s.487">
         <w xml:id="VanGinniken.p.1.s.487.w.1" class="PUNCTUATION" space="no">
           <t>†</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.2" class="WORD">
           <t>Stomper</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.3" class="PUNCTUATION" space="no">
           <t>(</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.4" class="WORD" space="no">
           <t>knoeier</t>
         </w>
         <w xml:id="VanGinniken.p.1.s.487.w.5" class="WORD">
           <t>)</t>
         </w>

the folia::text() function is used to extract the sentence text, delivering: †Stomper (knoeier) which FoLiA-stats sees as a bigram.

Maybe this is NOT what was intended! A reasonable thing to do would be to keep the tokenization giving the 5-gram: † Stomper ( knoeier )

I suggest adding a parameter to FoLiA-stats: --keep-tokens that does so, if desired.

proycon commented 6 years ago

At least in the python library I think I have a retaintokenisation parameter (which indeed defaults to False), I guess libfolia has the same.

For FoLiA-stats it makes sense to not detokenise by default I'd say.

kosloot commented 6 years ago

so @proycon suggests a --detokenize option, with the default being to NOT do that. Fine with me, but.... That would break the current default of FoLiA-stats, which is to detokenize. I am unsure if that would affect a lot of current work. (don't think so..) @martinreynaert please comment.

martinreynaert commented 6 years ago

I am fine with proycon's suggestion too.

Ko, I do not actually see that FoLiA-stats default is to detokenize. I have recently run it on a few hundred years of KB newspapers, so the following is definitely not a statistically valid sample...

reynaert@red:/reddata/TEST$ grep '<t class="OCR"' ddd.010131457.mpeg21.a0016.folia.xml |grep -C 2 'offset="5171"'

hand.

    <t class="OCR" offset="5167">kan</t>
    <t class="OCR" offset="5171">toevloeyen</t>
    <t class="OCR" offset="5182">;</t>
    <t class="OCR" offset="5184">behoeft</t>

Produces the following ngrams:

reynaert@red:/reddata/TEST$ cat /reddata/Nederlab/KBkranten/FOLIAnottarred/FRQ/FOLIAstats.KBkranten1861.wordfreqlist.3-gram.tsv |grep 'toevloeyen ;' kan toevloeyen ; 1 38936848 81.0443 toevloeyen ; behoeft 1 43734946 91.0312

So: no sign of detokenization there!

kosloot commented 6 years ago

Fake News! That FoliA is not tokenized (by ucto) at all! It even doesn't have \<s> to \<w> nodes. Is DOES have \<str> nodes, which are counted by FoLiA-stats. These nodes seem to have some form of tokenization, probably (and luckily!) introduced in the original Alto files. but that is NOT guaranteed I think. Bottom-line: adding a --detokenize option in FoLiA-stats, and reversing the behavior, seems feasible. And easy to implement.

kosloot commented 6 years ago

implemented.