common-voice / cv-dataset

Metadata and versioning details for the Common Voice dataset
https://commonvoice.mozilla.org/datasets
Mozilla Public License 2.0
141 stars 15 forks source link

Feature request: Datasets with only validated recordings #27

Open soliviantar opened 1 year ago

soliviantar commented 1 year ago

I've posted this already in the main repo, but seeing #26 here makes me think this might be the more adequate place to request this.

When downloading datasets, one must download the whole set (or a delta) including all sentences and recordings, whether validated or not, even if the user only needs the validated data. This consumes a lot of bandwidth, time and disk space, and it is not environmentally friendly either.

Offering the option to just download the part of the dataset with validated recordings would save a lot of time and make the data more accessible to more people. Being able to download only the tsv files would also be a good addition, but this is already addressed in #26.

I don't know how complex it would be to implement this, but I feel this would be a very useful quality of life feature, so I hope it is taken into consideration.

Thanks for your work in this amazing project in any case!