chavinlo / musicgen_trainer

simple trainer for musicgen/audiocraft
GNU Affero General Public License v3.0
1 stars 0 forks source link

full code for usage after finetuning? #4

Open karen-pal opened 1 year ago

karen-pal commented 1 year ago

Hello I was able to finetune a model following your instructions. Thank you for your repo. I'm currently stuck trying to use the output model.

This is what I've got so far

from audiocraft.models import MusicGen
import torch

# Using small model, better results would be obtained with `medium` or `large`.
model = MusicGen.get_pretrained('small')

model.lm.load_state_dict(torch.load('models/lm_final.pt'))

When I run this i get something that looks like a success message :

<All keys matched successfully>

but when i want to generate

from audiocraft.utils.notebook import display_audio

output = model.generate(
    descriptions=[
        PROMPT_IN_TRAIN_DATASET_1,
        PROMPT_IN_TRAIN_DATASET_2,    ],
    progress=True
)
display_audio(output, sample_rate=32000)

I don't get anything similar to the training dataset. Can anyone help me out? Am I doing something wrong? I have the feeling I'm not loading correctly my local finetuned model.

Thanks

chavinlo commented 1 year ago

Currently only overfit works. I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets. Although I think the main reason is data: image

chavinlo commented 1 year ago

Note that this was with 2h of audio

FeelTheFonk commented 1 year ago

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

chavinlo commented 1 year ago

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

Mind linking the wandb logs?

FeelTheFonk commented 1 year ago

Hey there, was your dataset well prepared? The sounds must be correctly cut in 35-second parts, in .wav MONO 16bits depth and 32 000Hz, I was able to fine tune from the small model with my own music and I had relatively good results (hyperparameters of my training: lr 0.0000001, 10 epochs, use_wandb 1).

Mind linking the wandb logs?

I'm sorry, I wrote the argument but I did not even used it for this training. I'll share logs for a next one

jidanhuang commented 1 year ago

Recently I tried to train musicgen on a TTS task and failed. I found that if I used a small dataset, val_loss would reach a still high minimum value. If I used a larger dataset, about 500h, val_loss and train_loss would drop at a very low rate

karen-pal commented 1 year ago

I tried again with a small dataset but more compact (many examples with the same annotation) and was able to train it successfully by overfitting - i trained it for 30 epochs, all other hyperparameters being the default. Of course it resulted in a complete collapse of the model though - as in, it was afterwards unable to make any other prompt.

chavinlo commented 1 year ago

@jidanhuang @karen-pal I think the solution could be to just use the original shape (1, 4, 1500) rather than the probs (1, 4, 1500, 2048) I could try later but I dont have time right now

jidanhuang commented 1 year ago

Sorry, but I don't understand which part of the code should be changed to use the original shape

jbmaxwell commented 1 year ago

Currently only overfit works. I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets. Although I think the main reason is data: image

How does the model behave with overfitting? Is there some obvious failure mode? Like, does everything sound similar or does output just significantly degrade in audio quality? I guess I'm just curious what you mean when you say that overfit "works"... ?

karen-pal commented 1 year ago

Currently only overfit works. I was only able to test it on overfit twice at 3am before I pushed most of the code here so I supposed it was gonna work on longer datasets. Although I think the main reason is data: image

How does the model behave with overfitting? Is there some obvious failure mode? Like, does everything sound similar or does output just significantly degrade in audio quality? I guess I'm just curious what you mean when you say that overfit "works"... ?

I also overfitted the model... The loss got very low and the model couldn't make anything but the fine tuning dataset.... Like it stopped being able to make any other sounds, no more lofi beats nor techno, nothing! I trained it on my voice to see if I could add background music and etc but it became impossible for the model to produce any other type of sound or modify a sound according to an instruction. You can hear the result here https://www.instagram.com/reel/CuDvXSetzKX/

jbmaxwell commented 1 year ago

Ah, okay, so it loses any capacity for generalization. Makes sense. Just out of curiosity, what learning rate did you use? Also, did you use a learning rate scheduler—if so, which one?

EDIT — Sorry, yes, I see you used the default params.

Liujingxiu23 commented 1 year ago

@karen-pal Hi I want to train the model too and want same help. When you use "lr 0.0000001", the model works? the lr is really small. Is the model really update the parameters using your dataset, or the model actually maintain the state of the pretrained model without updating?

And two weeks ago, you say "overfitted the model." with lr 0.0000001, you model overfitted?

How many hours of training data did you use?

Liujingxiu23 commented 1 year ago

Recently I tried to train musicgen on a TTS task and failed. I found that if I used a small dataset, val_loss would reach a still high minimum value. If I used a larger dataset, about 500h, val_loss and train_loss would drop at a very low rate

@jidanhuang Hi, how about your training of TTS using 500h dataset? Is the model converged? Can clear wave be synthesized?

chavinlo commented 1 year ago

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

Liujingxiu23 commented 1 year ago

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

Thank you for your relply! I will update my trainning status after a few days

Liujingxiu23 commented 1 year ago

I use https://github.com/neverix/musicgen_trainer for training,I use 35-hours music data to finetune on the pretrained small model, using 7 gpus with batchsize=10 for each gpu. The training loss continues decline to about 3.5,but the valid continues to rise to 6.0+,the generated waves seem good.

I'm thinking that whether the valid loss makes sense since we only use unconditional training mode?Or is the training make sense if the valid loss become larger and larger? @chavinlo

karen-pal commented 11 months ago

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

I'm also going to be trying this fork.

I already trained a first model with only 5 min of transcribed speech to text and the results are promising... i was able to add spoken text with my voice and also add some musicality. I'll keep you updated with my next results!

jidanhuang commented 11 months ago

@Liujingxiu23 @karen-pal @jbmaxwell Theres a fork of this repo that works: https://github.com/neverix/musicgen_trainer

I'm also going to be trying this fork.

I already trained a first model with only 5 min of transcribed speech to text and the results are promising... i was able to add spoken text with my voice and also add some musicality. I'll keep you updated with my next results!

This fork said that removing the gradient scaler, increasing the batch size and only training on conditional samples makes training work. so i just removed the gradient scaler and only trained on conditional samples on this chavinlo/musicgen_trainer project with the training batch size of 16. However, it it sound like there's no difference in training effect. I can only reduce the loss of val dataset to 3.88 with 500 hours music training dataset. The effect of music generation sounds neither good nor bad. I'm not sure if there's anything wrong with my code. maybe i will try official training code and report my results.

chavinlo commented 11 months ago

yeah I heavily suggest using the official repository as it now has training code I don't plan on updating this repo right now either, too busy on other things