facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.59k stars 2.09k forks source link

the grad_norm and grad_scale in continue trainning of musicgen #212

Closed Liujingxiu23 closed 1 year ago

Liujingxiu23 commented 1 year ago

I use my own dataset to continue training using pretrained musicgen small model. using adamw with lr=1e-5, use ema=true,batchsize=64,and leave other parameters as default setting. the train_loss_ce and valid_loss_ce are both declined, train_loss_ce from 3.51968 to 3.419 after trainning about 12 hours,and the generated wave sounds ok. But the grad_norm and grad_scale are as follows. What should I do to improve the training? Set lr more smaller?

  "grad_norm": Infinity,
  "grad_norm": NaN,
  "grad_norm": 1.315651535987854,
  "grad_norm": NaN,
  "grad_norm": 1.2862744331359863,
  "grad_norm": NaN,
  "grad_norm": NaN,
  "grad_norm": 1.2918434143066406,
  "grad_norm": NaN,
  "grad_norm": 1.2914090156555176,
  "grad_norm": NaN,
  "grad_norm": NaN,
  "grad_norm": NaN,
  "grad_norm": 1.3095765113830566,
  "grad_norm": NaN,
  "grad_norm": 1.3084871768951416,

  "grad_scale": 32784.3828125,
  "grad_scale": 36601.85546875,
  "grad_scale": 61685.76171875,
  "grad_scale": 94142.4609375,
  "grad_scale": 94765.0546875,
  "grad_scale": 110657.5390625,
  "grad_scale": 69697.5390625,
  "grad_scale": 85524.4765625,
  "grad_scale": 70877.1875,
  "grad_scale": 125730.8125,
  "grad_scale": 131858.4375,
  "grad_scale": 108855.296875,
  "grad_scale": 75399.171875,
  "grad_scale": 82542.59375,
  "grad_scale": 69664.765625,
  "grad_scale": 51380.22265625,
cvillela commented 1 year ago

Hello @Liujingxiu23 ! Why did you choose to use Adam instead of DAdam, which is the default? Any particular reason?

Liujingxiu23 commented 1 year ago

@Liujingxiu23 when the trainning code has not shared, I use my own trainning code which use adamw, my own trainning code works so I use adamw. Have you train the musicgen model, is everything good, especially the grad_norm and grad_scale?

cvillela commented 1 year ago

@Liujingxiu23 thanks for the reply. I am running the finetuning over ~55 hours of audio, my valid_ce decreased from 3.1 -> 3.04 and train_ce from 3.25 -> 3.15 (not what I expected, maybe you had a larger dataset?). The generated wavs sound ok, but lots of audio artifacts. Using a small model and batch size of 16, DAdam optimizer, ema=True and default parameters for adam:

These are my currents for grad_scale (30 first): 65568.768, 131137.536, 145162.24 , 260898.816, 284164.096, 262406.144, 282460.16 , 277217.28 , 238354.432, 155254.784, 287965.184, 268435.456, 261685.248, 134676.48 , 244580.352, 148635.648, 262537.216, 263847.936, 273874.944, 134021.12 , 267255.808, 169607.168, 133824.512, 220856.32 , 264896.512, 267780.096, 263192.576, 264372.224, 272367.616, 151977.984

and grad_norm (30 first): 4.763715848565101 4.717671585798263 inf 2.393838109135628 inf inf inf inf nan 2.404133801817894 inf inf nan 2.427801131606102 nan 2.428246664404869 inf inf nan 2.452121812582016 inf nan nan 2.476549625754356 inf ... inf inf inf nan

Grad norm values, especially, are looking very weird, exploding and vanishing. Do you have suspicions for what could it be?

Liujingxiu23 commented 1 year ago

@cvillela My dataset is about 400 hours. It seems that we have the same problem of grad even though the detail trainning parameters are different. I can not share images which showed in tensorboard, the grad_norm seems grow samller at first and then larger except nan and inf, just like yours. I have no idea why the grad are so weird.. and donot know how to fix it... Maybe max_norm should be adjusted?

cvillela commented 1 year ago

@Liujingxiu23 Thank you for sharing. My grad_norm is behaving exactly like yours. Don't know if it is expected. I refrained from modifying max_norm because of the nature of DAdam and EMA. But it may be a solution!

tanggang1997 commented 1 year ago

hello,Hi, may I ask why the generated currents sound loud after my training, and may I ask how your data is labeled?

Liujingxiu23 commented 1 year ago

@Liujingxiu23 my data is not labeled, and not processed at all

tanggang1997 commented 1 year ago

Are you fine-tuning or re-training? What's the amount of data?

Liujingxiu23 commented 1 year ago

Are you fine-tuning or re-training? What's the amount of data?

finetune, 400 hours

Liujingxiu23 commented 1 year ago

@cvillela use float32,autocast: false, can solve the problem of grad

tanggang1997 commented 1 year ago

use float32,autocast: false, can solve the problem

May I ask how you exported the audio compression model or used the original pre trained model when infer? Let me see Xp From_ SIG can export language models, but cannot export audio models

Liujingxiu23 commented 1 year ago

https://github.com/facebookresearch/audiocraft/blob/main/docs/ENCODEC.md has code for exporting the compression model

from audiocraft.utils import export from audiocraft import train xp = train.main.get_xp_from_sig('SIG') export.export_encodec( xp.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/compression_state_dict.bin')

carlthome commented 1 year ago

Also seeing grad_norm=INF in training (though not finetuning, but from random weights) and curious because both training cross-entropy and perplexity go down despite that, which seems strange.

Liujingxiu23 commented 1 year ago

Also seeing grad_norm=INF in training (though not finetuning, but from random weights) and curious because both training cross-entropy and perplexity go down despite that, which seems strange.

Is your training from random weights work well? How many hours of musicdata you use?

cvillela commented 1 year ago

@Liujingxiu23 when the trainning code has not shared, I use my own trainning code which use adamw, my own trainning code works so I use adamw. Have you train the musicgen model, is everything good, especially the grad_norm and grad_scale?

@cvillela use float32,autocast: false, can solve the problem of grad

Awesome! Thank you very much.

On another issue, I read on the paper that using EMA only worked for finetuning the small model. When using the larger models, I am using AdamW, but the model does not converge as fast. Have you experimented with EMA finetuning the medium / large models?

Liujingxiu23 commented 1 year ago

@cvillela I am also wondering whether to use ema. "the paper that using EMA only worked for finetuning the small model" The paper claims this? I may miss this part.

I didn't understand the working principle of EMA before, I set ema=true and updates=10. I think this value may be too small. Under this setting, my training was not very fast either, the valid loss first lower then higher,the train loss keep decline.

I have only finetuned the small and medium model.

carlthome commented 1 year ago

Is your training from random weights work well? How many hours of musicdata you use?

Training loss went down at least, but still curious about the gradient norm. Only 500 hours so far but intending to increase that as I get more familiar with AudioCraft.

Liujingxiu23 commented 1 year ago

@carlthome You can use float32 but the trainning is slower and batchsize should be smaller. But when I use float32 , the grad_norm seems normal, but the trend of the value is same as the training of float16, lower first and then increase.

carlthome commented 1 year ago

Thanks for the tip @Liujingxiu23!

cvillela commented 1 year ago

@cvillela I am also wondering whether to use ema. "the paper that using EMA only worked for finetuning the small model" The paper claims this? I may miss this part.

I didn't understand the working principle of EMA before, I set ema=true and updates=10. I think this value may be too small. Under this setting, my training was not very fast either, the valid loss first lower then higher,the train loss keep decline.

I have only finetuned the small and medium model.

I am very sorry, I actually made a mixup. I meant to say the D-Adaptation optimizer (DAdam) only showed significant improvement on the small model, not the EMA.

We further rely on D-Adaptation based automatic step-sizes [Defazio and Mishchenko, 2023] for the 300M model as it improves model convergence but showed no gain for the bigger models.

cvillela commented 1 year ago

Also @Liujingxiu23 , are you normalizing samples before feeding them to the model training? I reckon that the audiodataset object does normalize samples, but from experimenting with inference and also regarding this Issue #236 , maybe samples should also be normalized before feeding them into training?

Liujingxiu23 commented 1 year ago

@cvillela I am also wondering whether to use ema. "the paper that using EMA only worked for finetuning the small model" The paper claims this? I may miss this part. I didn't understand the working principle of EMA before, I set ema=true and updates=10. I think this value may be too small. Under this setting, my training was not very fast either, the valid loss first lower then higher,the train loss keep decline. I have only finetuned the small and medium model.

I am very sorry, I actually made a mixup. I meant to say the D-Adaptation optimizer (DAdam) only showed significant improvement on the small model, not the EMA.

We further rely on D-Adaptation based automatic step-sizes [Defazio and Mishchenko, 2023] for the 300M model as it improves model convergence but showed no gain for the bigger models.

I use D-Adaptation to refine the medium model, but is fails, the loss did not decline, then I use adamw. Is your loss decline using D-Adaptation?

Liujingxiu23 commented 1 year ago

Also @Liujingxiu23 , are you normalizing samples before feeding them to the model training? I reckon that the audiodataset object does normalize samples, but from experimenting with inference and also regarding this Issue #236 , maybe samples should also be normalized before feeding them into training?

I did not normalize samples. most waves I use is of "mp3" format, they seem to have the same volume. But normalization may be helpfull.

cvillela commented 1 year ago

@Liujingxiu23

The loss does go down (and performs better) using D-Adaptation on the small model, as the paper states. For the medium/large models I did not use it. By the way, my large model is overfitting with 350 hours of audio, AdamW with lr=1e-4 and FSDP.