facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.23k stars 2.03k forks source link

Advice to preprocess small custom dataset #324

Open r03922123 opened 9 months ago

r03922123 commented 9 months ago

Thank you all for making this project public!

When I try to do train_test_split of my custom dataset (it's a small dataset, less than 2000 samples), I am thinking about whether I should put all samples (segments) of the same song in the same split or not, to prevent kind of data leakage.

I am thinking into this is that there is always multiple loops in a song, the strategy could not be the same as dealing with speech or image.

Can you share what is your strategy to distribute the segments into split? What is your advice if custom dataset scale is not large as pretrained model?

If anyone can share advice or reference, I will really appreciate it! Thank you!