facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.15k stars 2.01k forks source link

Quality problems with fine-tuning the musicgen model #447

Open MajaSoure opened 2 months ago

MajaSoure commented 2 months ago

I would like to perform finetuning on a small musicgen model. I have a dataset consisting of different short sounds. Actually, it's not big. But I successfully trained on the melodies of one instrument, where the dataset size was extremely small, only 3 hours, and I got interesting results. However, with sounds everything is different. In 80% of cases or even more often, I encounter the fact that after the main attack sound I have a crackling sound in the generated audio track.

I've tried various ways to optimize training, but so far nothing obvious has helped, such as reducing the learning rate or dropout. Also, my logs look very suspicious from the very beginning of the training: Train Summary | Epoch 1 | lr=1.00E+00 | grad_norm=INF | grad_scale=45645.824 | ce=0.962 | ppl=2.650 | duration=2472.758

It's also interesting that when I enter a word with a small letter and a word with a capital letter in the prompt, I get different results. In this case, everything depends on the word. In one case, the result is as expected, but in the other there is a complete bunch of random sounds, as if the model had not been trained. (By the way, I checked the original models and during generation the situation with uppercase and lowercase letters for the same word is similar.) In fact, I'd be very interested to know more about how merging text tags that are packaged in json format for each sample works. I'm new to learning your model. Thanks in advance for your answer and help!

yocontra commented 2 months ago

These parameters might be useful to look at within the musicgen codebase, if you want to understand how it goes from json -> text:

dataset.train.merge_text_p
dataset.train.drop_desc_p
dataset.train.drop_other_p
conditioners.description.t5.word_dropout

Check out this project for a working example on fine tuning: https://github.com/sakemin/cog-musicgen-fine-tuner