DoReCo / doreco

Collaborative data curation for DoReCo
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

How are word tokens counted? #1

Closed xrotwang closed 2 years ago

xrotwang commented 2 years ago

I have a problem reproducing the word token counts for the Cashinahua corpus given at http://concat.huma-num.fr/languages/cash1254 I'd expect to find 10,930 words in the CSV data (4217+1021+5692), but couldn't get exactly that number in my attempts to count. The closest I get is by taking (file, wd_ID) as unique word identifier and removing all words which are only a pause:

$ csvgrep -c wd -m"<p:>" -i doreco_cash1254_dataset/doreco_cash1254_ph.csv | csvcut -c file,wd_ID | sort | uniq| wc -l
10957

Removing <<fp>> words as well, I already drop below:

$ csvgrep -c wd -m"<p:>" -i doreco_cash1254_dataset/doreco_cash1254_ph.csv | csvgrep -c wd -m"<<fp>>" -i | csvcut -c file,wd_ID | sort | uniq| wc -l
10862

Btw.: It may be worth spelling out that wd_ID is only unique within single files of a corpus, i.e. that (file, wd_ID) identifies words within a corpus.

matt-stave commented 2 years ago

Thanks. Turns out file-level word counts were not updated. We've detected some other issues and will be updating the word counts once they're resolved.

xrotwang commented 1 year ago

I'm still struggling with this. So, I tried to reproduce the word token counts for Cashinahua in DoReCo 1.2.

So I downloaded doreco_cash1254_wd.csv from https://api.nakala.fr/data/10.34847/nkl.a8f9q2f1/f299777c5c0ccaa2083d014f79071d37fcfba39f via the link from the language metadata https://nakala.fr/10.34847/nkl.a8f9q2f1

First thing I noticed is that the CSV seems to be ill-formatted:

$ csvstat cash_wd.csv
ValueError: Row 0 has 20 values, but Table only has 18 columns.

There seem to be two trailing coumns in the data (but not in the header). Adding header columns confirms this:

 19. "x"
>-------Type of data:          Boolean
>-------Contains null values:  True (excluded from calculations)
>-------Unique values:         1
>-------Most common values:    None (13958x)

 20. "y"
>-------Type of data:          Boolean
>-------Contains null values:  True (excluded from calculations)
>-------Unique values:         1
>-------Most common values:    None (13958x)

Row count: 13958

Now trying to reproduce the word token count (per file) as given on https://doreco.huma-num.fr/languages/cash1254

>>> 3959 + 639 + 5445
10043

I split the wd.csv into three files grouped by file and get:

$ csvgrep -c wd -m "<p:>" -i cash_GN_wd.csv | csvgrep -c wd -r "\<\<" -i | csvstat --count
3964
$ csvgrep -c wd -m "<p:>" -i cash_JC_wd.csv | csvgrep -c wd -r "\<\<" -i | csvstat --count
639
$ csvgrep -c wd -m "<p:>" -i cash_MB_wd.csv | csvgrep -c wd -r "\<\<" -i | csvstat --count
5447

Why do I only get the correct numbers for one file? After stripping out silent pauses and labels, there are no < left in the wd column - so no incomplete labels or similar.

The difference of 2 for the third file might be explained by not counting rows with non-ASCII wd columns (because there are two instances of this):

$ csvgrep -c wd -m "<p:>" -i cash_wd.csv | csvgrep -c wd -r "\<\<" -i | csvcut -c file,wd | csvgrep -c wd -r "^[a-zA-Z\-]+$" -i
file,wd
doreco_cash1254_MB_Autobiography,añunan
doreco_cash1254_MB_Autobiography,xaxún

But this would still leave the difference of 5 for the first file unexplained.

xrotwang commented 1 year ago

Another data point: For the first transcription file for Anal I count 1526 word tokens (as opposed to the 1340 indicated on the web site). There are a lot more non-ASCII words in this file (326) - alas too many to fit with my suspicion above.

xrotwang commented 1 year ago

FWIW, for Yongning I'm getting

$ csvstat yong_wd.csv -c wd
  6. "wd"

    Type of data:          Text
    Contains null values:  False
    Unique values:         2494
    Longest value:         23, characters
    Most common values:    <p:> (2039x)
                           tʰi (481x)
                           mv̩ (292x)
                           <<fp>əəə> (162x)
                           wɤ (138x)

Row count: 9816

I.e. 9816 different wd_IDs listed for 9816 rows in wd.csv. 2039 of which are silent pauses. But the language metadata file (and the website) say there are 8877 word tokens.

xrotwang commented 1 year ago

@LuPaschen sorry, I didn't include all info you gave regarding how words are counted. Still, skipping "****" (fillers) and non-core speakers doesn't change anything for Cashinahua (there are no fillers and all 3 speakers are core speakers).

For Anal, the tokens I count go down a bit:

$ csvgrep -c wd -m "<p:>" -i anal_wd.csv | csvgrep -c wd -r "\<\<" -i | csvgrep -c wd -m"****" -i | csvgrep -c core_extended -m"core" | csvgrep -c speaker -m"unknown" -i | csvgrep -c speaker -m"UNK" -i | csvstat --count
14312

but are still too high.

For Yongning I undercounted, so no change here, either.

xrotwang commented 1 year ago

Computing the word token counts based on the *_ph.csv files rather than the *_wd.csv files seems to get me closer:

$ csvgrep -c wd -m "<p:>" -i raw/anal1239_wd.csv | csvgrep -c wd -r "\<\<" -i | csvgrep -c wd -m"****" -i | csvgrep -c core_extended -m"core" | csvgrep -c speaker -m"unknown" -i | csvgrep -c speaker -m"UNK" -i | csvcut -c wd_ID | sort | uniq | wc -l
14313
$ csvgrep -c wd -m "<p:>" -i raw/anal1239_ph.csv | csvgrep -c wd -r "\<\<" -i | csvgrep -c wd -m"****" -i | csvgrep -c core_extended -m"core" | csvgrep -c speaker -m"unknown" -i | csvgrep -c speaker -m"UNK" -i | csvcut -c wd_ID | sort | uniq | wc -l
13280

Considering that I'm already skipping pauses, labels, fillers, non-core speakers in the above query, it isn't clear to me, why *_ph.csv is missing more than 1,000 words.

LuPaschen commented 1 year ago

@xrotwang Thanks for bringing this up. Word counts are a bit messy at the moment, I have to admit. I believe they were not consistently updated on the website after the DoReCo 1.2 release, which may explain some of the confusion.

One thing to note, though, is that "core" texts only count core speakers (present in _ph csv's) while "extended" texts count all speakers, regardless of their appearance in _ph csv's. So in short, to count "core" words, use the ph CSV, and to count "extended" words, use the wd CSV.

So if you find discrepancies between your counts and the numbers displayed on the website or in our own metadata files, it's probably safer to use your own calculations.

Note that In the future (i.e. DoReCo 1.3), the formula will be minimally different: