Quality problems with fine-tuning the musicgen model

I would like to perform finetuning on a small musicgen model. I have a dataset consisting of different short sounds. Actually, it's not big. But I successfully trained on the melodies of one instrument, where the dataset size was extremely small, only 3 hours, and I got interesting results. However, with sounds everything is different. In 80% of cases or even more often, I encounter the fact that after the main attack sound I have a crackling sound in the generated audio track.

It's also interesting that when I enter a word with a small letter and a word with a capital letter in the prompt, I get different results. In this case, everything depends on the word. In one case, the result is as expected, but in the other there is a complete bunch of random sounds, as if the model had not been trained. (By the way, I checked the original models and during generation the situation with uppercase and lowercase letters for the same word is similar.) In fact, I'd be very interested to know more about how merging text tags that are packaged in json format for each sample works. I'm new to learning your model. Thanks in advance for your answer and help!

facebookresearch / audiocraft

Quality problems with fine-tuning the musicgen model #447