common-voice / cv-dataset

Metadata and versioning details for the Common Voice dataset
https://commonvoice.mozilla.org/datasets
Mozilla Public License 2.0
141 stars 15 forks source link

Wrong checksums for Common Voice Corpus 13.0 #21

Open paniedziela opened 1 year ago

paniedziela commented 1 year ago

Hello, I usually verify checksum after download and till Common Voice Corpus 12.0 it worked with no problem, but now (Common Voice Corpus 13.0) I suspect they are wrong, because I have no issues with download, but the checksums don't match, I don't have resources nor time to check more datasets, but I can provide a few (I suppose all checksums from this version are calculated wrong):

HarikalarKutusu commented 1 year ago

There was a similar report on delta segments here:

https://discourse.mozilla.org/t/sha256-checksum-seems-to-be-wrong-for-common-voice-delta-segment-xx-x-and-what-is-delta-segment/111765/5

A possible suspect: The original release of v13.0 lacked the default splits, but they were produced to get the records here. But you downloaded these datasets more than a couple of days ago. A couple of days ago, they are included. But this time, the "reported.tsv" files are taken from up-to-date ones. So old DL or new, they should be different than the records here.

The only way to correct them is to recalculate and update the values here I think...