DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.45k stars 162 forks source link

Number of fine tuning steps recommended when fine tuning to avoid overfitting #90

Closed Ca-ressemble-a-du-fake closed 4 months ago

Ca-ressemble-a-du-fake commented 1 year ago

Hi,

Given a target speaker dataset what is roughly the number of fine tuning steps that should be undergone ?

NeMo "recommends 1000 steps per minute of audio for fastpitch and 500 steps per minute of audio for HiFi-GAN."

Can the same general recommendation also apply for Toucan TTS when fine tuning Meta pretrained model on given dataset ? The goal being to find the sweet spot before overfitting appears.

Any advice appreciated,

Thanks in advance

Flux9665 commented 1 year ago

Yes, those numbers should work, if the learningrate is dropped to 1/10 of what it usually is I would say. I would however never finetune the vocoder (it should already work on unseen speakers just as well as on seen speakers, and finetuning a GAN is very tricky).

Also for the FastPitch finetuning steps, I would use no more than 20k steps, regardless of how many minutes are available. At this point it would probably better to train from scratch than to finetune.

Ca-ressemble-a-du-fake commented 1 year ago

Ok thank you. So should I change the learning rate in the fine tuning script or it already takes the 1/10 factor into account ?

Flux9665 commented 1 year ago

the current version does not change the learning rate in the finetuning script. This is because the ideal learning rate for your finetuning data is highly dependent on the amount of datapoints used for finetuning. For few datapoints, lower finetuning learning rates are needed, but for lots of datapoints, one can use the original learning rate without problems.