common-voice / cv-dataset

Metadata and versioning details for the Common Voice dataset
https://commonvoice.mozilla.org/datasets
Mozilla Public License 2.0
141 stars 15 forks source link

Minor Bug in Text Corpus calculations #30

Open HarikalarKutusu opened 7 months ago

HarikalarKutusu commented 7 months ago

This happened in v17.0 data and only for cnh locale. Somewhere a minus 1 is added (looks like to drop a header line), but it gives negative value if there is no data (so no header line). Laiholh (Hakha) (Hakha Chin) locale has no unvalidated sentences, and the unvalidated_sentences.tsv file is completely empty.

    "cnh": {
      ...
      "validatedSentences": 5218,
      "unvalidatedSentences": -1,
      ...
KathyReid commented 7 months ago

Nice catch, Bülent!