common-voice / common-voice-bundler

Script for bundling Common Voice (https://commonvoice.mozilla.org/) clips by language
10 stars 7 forks source link

Bundle link for dataset common voice 7 #15

Open patrickvonplaten opened 3 years ago

patrickvonplaten commented 3 years ago

Hey Common-Voice team!

Thanks a lot for releasing the common voice 7 dataset - it's great to see so many new languages!

At Hugging Face, we have worked a lot with the common voice 6.1 dataset and trained speech models in almost each language of the common voice 6 dataset. In total we open-sourced 240 speech models trained on the common voice dataset, see here.

For the common voice 6.1 dataset it is possible to directly download a language specific dataset via this bundle link:

https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/{lang}.tar.gz

This is super convenient and allows us to provide the following simple commands to the community to download and process the dataset:

from datasets import load_dataset

ds = load_dataset("common_voice", "ab")

Do you guys think there is a chance that you could also provide a bundled link for the common voice 7 dataset?

Best, the Datasets team @ Hugging Face

ftyers commented 3 years ago

Hi @patrickvonplaten,

Thanks for your message. Does the load_dataset() function give the ability to collect an email address from the person loading the data? You might also be interested in the following issue: https://github.com/common-voice/common-voice/issues/3262

patrickvonplaten commented 3 years ago

Hey @ftyers,

Thanks a lot for your great write-up at https://github.com/common-voice/common-voice/issues/3262 - I very much agree with your points!

@lhoestq @thomwolf - could we maybe provide an optional "email address" field that is required for Common Voice 7?