Helsinki-NLP / OPUS

The Open Parallel Corpus
58 stars 7 forks source link

GNOME Catalan #15

Open jorgtied opened 4 months ago

jorgtied commented 4 months ago

I think there might be something wrong with one file from the GNOME corpus. These links are from "legacy" OPUS, but I think the problem might be the same obtaining the file with a more current method. The file is the Catalan (ca) monolingual plain text file from the GNOME corpus: https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.txt.gz According to the stats on the website, these are the expected stats for the file: language files tokens sentences ca 2,071 6.4M 0.9M However, the downloaded file "ca.txt.gz" has much fewer tokens and sentences: zcat GNOME_v1_mono_ca.txt.gz | wc 1422 13808 87751 In contrast, the corresponding ca.tok.gz is a much larger file which actually has the expected number of lines. zcat GNOME_v1_mono_ca.tok.gz | wc 668727 6386997 33416861 ( from https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.tok.gz ) Could you check whether the ca.txt.gz is wrong?