cfoster0 / CLAP

Contrastive Language-Audio Pretraining
BSD 3-Clause "New" or "Revised" License
87 stars 4 forks source link

Add Dataset Downloading, Info, and Checksums #1

Open cfoster0 opened 3 years ago

cfoster0 commented 3 years ago

We want to go for the largest datasets we can for this. They are listed in a Google doc. Not all of them will be downloadable via public links, so we want to provide checksums in this repo so that folks know they're working with the same data once they acquire it. Would also be nice to give the dataset info in the repo.

EDIT: Google doc is linked here.

louisgv commented 3 years ago

Sounds good, haven't heard back from Spotify yet. Mozilla common voice checksum:

https://commonvoice.mozilla.org/en/datasets

sha256 checksum: 0f8fdfc4fe715738be94ee49c4fb63d5f1608d2e6a43a2bed80f6cb871171c36

cfoster0 commented 3 years ago

Some stats on Common Voice English version 6.1.

1,224,864 validated clips, of which 1,224,858 have UTF-8 captions. 596,665 unique sentences 52 Python characters on average, and 52.9 bytes on average.

Quantiles of byte lengths:

10% - 23 20% - 32 30% - 39 40% - 45 50% - 52 60% - 60 70% - 67 80% - 74 90% - 83 95% - 90 98% - 97

Max byte length of 210.

From a quick sample of the audio data, the average clip length is just under 6 seconds.

afiaka87 commented 3 years ago

@cfoster0 what's the expected format here? Images of spectrograms?

cfoster0 commented 3 years ago

@afiaka87 Good question. For now, the plan is to preprocess the data in two steps:

  1. Trimmed 15 second .wav files, padded with silence if the original audio clip was shorter. Plus a dataframe mapping filenames to their text captions.
  2. Mel spectrograms of the audio saved as .pt files. lm_dataformat archive of the captions. (This step may change to TFRecords in the future)
afiaka87 commented 3 years ago

Why not just the images? We've got working code for that over in DALLE-pytorch right now which has loaded well in excess of 5 million image-text-pairs without bottlenecking. Could get started a bit faster that way and leave this issue open to implement a more efficient storage solution if the dataloader becomes a bottleneck.

afiaka87 commented 3 years ago

We want to go for the largest datasets we can for this. They are listed in a Google doc. Not all of them will be downloadable via public links, so we want to provide checksums in this repo so that folks know they're working with the same data once they acquire it. Would also be nice to give the dataset info in the repo.

Can you at least list them without the downloaded links? Or share a link to said Google document?

cfoster0 commented 3 years ago

For sure. Give me a minute and I'll list them here for starters.

cfoster0 commented 3 years ago

And I don't quite know what you mean by images. Spectrograms aren't really images even though you can look at them as if they were. For small scale tests I don't think the current code will bottleneck us.

cfoster0 commented 3 years ago

Largest English speech datasets:

The code within this repo should be agnostic to language and speech vs. non-speech audio. To see a larger list of datasets, see the Google doc here.

afiaka87 commented 3 years ago

And I don't quite know what you mean by images. Spectrograms aren't really images even though you can look at them as if they were. For small scale tests I don't think the current code will bottleneck us.

Ah my apologies - you're correct, there's no reason to store them as a visual representation.

afiaka87 commented 3 years ago

Largest English speech datasets:

* Spotify Podcasts Dataset https://podcastsdataset.byspotify.com/

* MLS http://www.openslr.org/94/

* Common Voice https://commonvoice.mozilla.org/en

* SPGISpeech https://datasets.kensho.com/datasets/spgispeech

The code within this repo should be agnostic to language and speech vs. non-speech audio. To see a larger list of datasets, see the Google doc here.

Fantastic that's quite the list! Thanks!