common-voice / cv-dataset

Metadata and versioning details for the Common Voice dataset
https://commonvoice.mozilla.org/datasets
Mozilla Public License 2.0
141 stars 15 forks source link

German 12.0 Segment missing train dev test TSV files #20

Open LozramA opened 1 year ago

LozramA commented 1 year ago

The newest german segment "cv-corpus-12.0-delta-2022-12-07-de.tar" does not include the train.tsv dev.tsv and test.tsv.

HarikalarKutusu commented 1 year ago

Hey @LozramA, I don't know your workflow, but, if you have validated.tsv file, you could actually merge it with the v11.0 validated.tsv and use CorporaCreator to generate a new train/dev/test set.

LozramA commented 1 year ago

Thanks @HarikalarKutusu but on german V11 segment are missing all tsv files. so this is completely unusable. I used CV12 full but thats getting now all too big for privatly available computer/GPU power. Was running many days to train and not really many epochs ( RTX 2060 Intel i7 )

HarikalarKutusu commented 1 year ago

Yes, if you not already have v11 in full you need to download it, unfortunately... And I know, it is a painful process.

Btw, if you already have the mp3 files, I can share full .tsv files with you for any version... I had to extract them for the cv-dataset-analyzer project I implemented.

And secondly, the correct repo for the issue is https://github.com/common-voice/common-voice-bundler