Closed xrotwang closed 2 years ago
Thanks. Turns out file-level word counts were not updated. We've detected some other issues and will be updating the word counts once they're resolved.
I'm still struggling with this. So, I tried to reproduce the word token counts for Cashinahua in DoReCo 1.2.
So I downloaded doreco_cash1254_wd.csv
from https://api.nakala.fr/data/10.34847/nkl.a8f9q2f1/f299777c5c0ccaa2083d014f79071d37fcfba39f via the link from the language metadata https://nakala.fr/10.34847/nkl.a8f9q2f1
First thing I noticed is that the CSV seems to be ill-formatted:
$ csvstat cash_wd.csv
ValueError: Row 0 has 20 values, but Table only has 18 columns.
There seem to be two trailing coumns in the data (but not in the header). Adding header columns confirms this:
19. "x"
>-------Type of data: Boolean
>-------Contains null values: True (excluded from calculations)
>-------Unique values: 1
>-------Most common values: None (13958x)
20. "y"
>-------Type of data: Boolean
>-------Contains null values: True (excluded from calculations)
>-------Unique values: 1
>-------Most common values: None (13958x)
Row count: 13958
Now trying to reproduce the word token count (per file) as given on https://doreco.huma-num.fr/languages/cash1254
>>> 3959 + 639 + 5445
10043
I split the wd.csv
into three files grouped by file
and get:
$ csvgrep -c wd -m "<p:>" -i cash_GN_wd.csv | csvgrep -c wd -r "\<\<" -i | csvstat --count
3964
$ csvgrep -c wd -m "<p:>" -i cash_JC_wd.csv | csvgrep -c wd -r "\<\<" -i | csvstat --count
639
$ csvgrep -c wd -m "<p:>" -i cash_MB_wd.csv | csvgrep -c wd -r "\<\<" -i | csvstat --count
5447
Why do I only get the correct numbers for one file? After stripping out silent pauses and labels, there are no <
left in the wd
column - so no incomplete labels or similar.
The difference of 2 for the third file might be explained by not counting rows with non-ASCII wd
columns (because there are two instances of this):
$ csvgrep -c wd -m "<p:>" -i cash_wd.csv | csvgrep -c wd -r "\<\<" -i | csvcut -c file,wd | csvgrep -c wd -r "^[a-zA-Z\-]+$" -i
file,wd
doreco_cash1254_MB_Autobiography,añunan
doreco_cash1254_MB_Autobiography,xaxún
But this would still leave the difference of 5 for the first file unexplained.
Another data point: For the first transcription file for Anal I count 1526 word tokens (as opposed to the 1340 indicated on the web site). There are a lot more non-ASCII words in this file (326) - alas too many to fit with my suspicion above.
FWIW, for Yongning I'm getting
$ csvstat yong_wd.csv -c wd
6. "wd"
Type of data: Text
Contains null values: False
Unique values: 2494
Longest value: 23, characters
Most common values: <p:> (2039x)
tʰi (481x)
mv̩ (292x)
<<fp>əəə> (162x)
wɤ (138x)
Row count: 9816
I.e. 9816 different wd_IDs listed for 9816 rows in wd.csv
. 2039 of which are silent pauses. But the language metadata file (and the website) say there are 8877 word tokens.
@LuPaschen sorry, I didn't include all info you gave regarding how words are counted. Still, skipping "****" (fillers) and non-core speakers doesn't change anything for Cashinahua (there are no fillers and all 3 speakers are core speakers).
For Anal, the tokens I count go down a bit:
$ csvgrep -c wd -m "<p:>" -i anal_wd.csv | csvgrep -c wd -r "\<\<" -i | csvgrep -c wd -m"****" -i | csvgrep -c core_extended -m"core" | csvgrep -c speaker -m"unknown" -i | csvgrep -c speaker -m"UNK" -i | csvstat --count
14312
but are still too high.
For Yongning I undercounted, so no change here, either.
Computing the word token counts based on the *_ph.csv
files rather than the *_wd.csv
files seems to get me closer:
$ csvgrep -c wd -m "<p:>" -i raw/anal1239_wd.csv | csvgrep -c wd -r "\<\<" -i | csvgrep -c wd -m"****" -i | csvgrep -c core_extended -m"core" | csvgrep -c speaker -m"unknown" -i | csvgrep -c speaker -m"UNK" -i | csvcut -c wd_ID | sort | uniq | wc -l
14313
$ csvgrep -c wd -m "<p:>" -i raw/anal1239_ph.csv | csvgrep -c wd -r "\<\<" -i | csvgrep -c wd -m"****" -i | csvgrep -c core_extended -m"core" | csvgrep -c speaker -m"unknown" -i | csvgrep -c speaker -m"UNK" -i | csvcut -c wd_ID | sort | uniq | wc -l
13280
Considering that I'm already skipping pauses, labels, fillers, non-core speakers in the above query, it isn't clear to me, why *_ph.csv
is missing more than 1,000 words.
@xrotwang Thanks for bringing this up. Word counts are a bit messy at the moment, I have to admit. I believe they were not consistently updated on the website after the DoReCo 1.2 release, which may explain some of the confusion.
One thing to note, though, is that "core" texts only count core speakers (present in _ph csv's) while "extended" texts count all speakers, regardless of their appearance in _ph csv's. So in short, to count "core" words, use the ph CSV, and to count "extended" words, use the wd CSV.
So if you find discrepancies between your counts and the numbers displayed on the website or in our own metadata files, it's probably safer to use your own calculations.
Note that In the future (i.e. DoReCo 1.3), the formula will be minimally different:
I have a problem reproducing the word token counts for the Cashinahua corpus given at http://concat.huma-num.fr/languages/cash1254 I'd expect to find 10,930 words in the CSV data (4217+1021+5692), but couldn't get exactly that number in my attempts to count. The closest I get is by taking
(file, wd_ID)
as unique word identifier and removing all words which are only a pause:Removing
<<fp>>
words as well, I already drop below:Btw.: It may be worth spelling out that
wd_ID
is only unique within single files of a corpus, i.e. that(file, wd_ID)
identifies words within a corpus.