Regularizing during fine-tuning?

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

11.84k stars 2.46k forks source link

Regularizing during fine-tuning? #2286

Closed jsilbergDS closed 3 years ago

jsilbergDS commented 3 years ago

Hello! Thank you again for the amazing work, I really appreciate it. I am fine-tuning a pre-trained QuartzNet model and wanted to ask what you'd recommend for regularization. I have updated the dropout from the pre-trained QuartzNet default 0.0 to 0.2 using:

cfg = copy.deepcopy(quartznet.cfg) print(len(cfg['encoder']['jasper'])) for i in range(0,18): cfg['encoder']['jasper'][i]['dropout'] = 0.2 print(OmegaConf.to_yaml(cfg)) quartznet2 = quartznet.from_config_dict(cfg)

But this seems to just cause loss to explode?

Thanks!

titu1994 commented 3 years ago

You should not apply dropout to the first or final Encoder layer. Also 0.2 is quite high - it's only required for training for an elongated period. For short finetuning runs, a small value of 0.1 is sufficient. Also weight decay should be 0.001 or lower when using dropout to prevent over regularization.

jsilbergDS commented 3 years ago

Thanks! In that case, probably makes sense for me to just leave dropout at 0 and just use an increased weight decay from the .0001 default? Thanks!

titu1994 commented 3 years ago

You could refer to the QuartNet paper, generally we recommend 0.001 to 0.0001 range of weights decay, tending to the higher side.

jsilbergDS commented 3 years ago

Thank you!