Advice to preprocess small custom dataset

Thank you all for making this project public!

When I try to do train_test_split of my custom dataset (it's a small dataset, less than 2000 samples), I am thinking about whether I should put all samples (segments) of the same song in the same split or not, to prevent kind of data leakage.

I am thinking into this is that there is always multiple loops in a song, the strategy could not be the same as dealing with speech or image.

Can you share what is your strategy to distribute the segments into split? What is your advice if custom dataset scale is not large as pretrained model?

If anyone can share advice or reference, I will really appreciate it! Thank you!

facebookresearch / audiocraft

Advice to preprocess small custom dataset #324