Add 'small' subset - Githubissues

andimarafioti commented 5 years ago

Hi! thanks for the dataset. It would be useful for me if you provided a 'small' subset like FMA (they do 8,000 tracks of 30s, 8 balanced genres (GTZAN-like) (7.2 GiB)). I know I could make a subset myself with the script cited on the readme, but I would need to download 100x the amount of data I want and then process it. If you think it's worth it, and are willing to host it, I can also make the subset myself and upload it somewhere. Thanks!

dbogdanov commented 5 years ago

Hi @andimarafioti, yes we are working on that ;-) Will update soon.

abugler commented 3 years ago

Hi! Is there an update on the small subset? Thank you so much.

dbogdanov commented 1 year ago

Note that we have included lower-bitrate mono audio downloads that significantly reduce the download size (full dataset: 508 GB to 156 GB). I assume this is not small enough for a "small" dataset...

We lack a specific proposal for what the small subset should include. Should it cover all tags in MTG-Jamendo or a subset of tags?

Another alternative is to create a version of the full dataset with audio fragments instead of full tracks. Using 2 min or 30 second fragments for each track reduces the total dataset size from ~3778 hours to 1856.7 or 464 hours, respectively. The low-bitrate mono audio 30-second fragment version would take ~19 GB which is very reasonable.

dbogdanov commented 1 year ago

Related to this, @philtgun has previously done a subset of MTG-Jamendo with one random track per artist (5 random trials) and one random track per album to see the statistics (autotagging_toy_0..4 and autotagging_toy_album_0). Leaving this here for reference.

MTG / mtg-jamendo-dataset

Add 'small' subset #13