NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.53k stars 3.23k forks source link

[FastPitch1.1/PyTorch] How to avoid gradient vanish in compund loss funcion? #1176

Closed JohnHerry closed 2 years ago

JohnHerry commented 2 years ago

Related to FastPitch1.1/PyTorch

Describe the bug The training is broken by "loss is NaN", As FastPitch loss function is compused by mel_loss, duration loss, pitch loss, energy loss and attention loss, I had print the context , the mel_loss is NaN. and the gradient is Zero.

To Reproduce This happens, occasionally, in the 100 - 400 epoches, so It is not caused by Bad Input.

Expected behavior

My question is higher, I want to know how to design to avoid graidient vanish when using compund loss function as in FastPitch. problem in each part of subloss function can cause the training broken. And if that happens, how to check the problem and find the reason?

Environment Please provide at least:

alancucki commented 2 years ago

Hi @JohnHerry ,

The problem is not with LJSpeech but custom data, right? In that case try lowering the learning rate. Higher LR leads to better models, but the training becomes unstable.

JohnHerry commented 2 years ago

Hi @JohnHerry ,

The problem is not with LJSpeech but custom data, right? In that case try lowering the learning rate. Higher LR leads to better models, but the training becomes unstable.

Thank you for the help. I had tried to change the optimizer from FusedLAMB to FusedNovoGrad, and the training is running OK now. I am not sure whether it will work all the time.

alancucki commented 2 years ago

About working all the time: with a higher LR you can get a bit better model, but some runs will fail. If you can afford it, I'd test a couple of different LRs, each with a few random seeds to get a feeling of what is safe.

Also, maybe some samples in your data are broken and cause large gradients, invoking the crash.

JohnHerry commented 2 years ago

Thank you. The training is going on, I will try the result after that.

JohnHerry commented 2 years ago

About working all the time: with a higher LR you can get a bit better model, but some runs will fail. If you can afford it, I'd test a couple of different LRs, each with a few random seeds to get a feeling of what is safe.

Also, maybe some samples in your data are broken and cause large gradients, invoking the crash.

I think the broken data should crash the training in the first epoch, while my training process broken after handards of epoches. We used pyworld.dio instead of librosa.pYIN as pitch estimator, that is the difference in the data preprocessing.

My try on optimizer of FusedNovoGrad proved to be a failure. The training is too slow. When I use FusedLAMB, I can get some resonable result on the 300 epoches, but with FusedNovoGrad, After the 4800 epoches it synthesis muffled audio.

I will try to reduce the learning rate, and get retried.