While fine-tuning works as expected, doing regular training with a dataset that isn't LJSpeech would eventually cause a NaN loss at some point.
The culprit appears to be the following line, which causes a division by zero if wav happens to contain perfect silence:
While fine-tuning works as expected, doing regular training with a dataset that isn't LJSpeech would eventually cause a NaN loss at some point. The culprit appears to be the following line, which causes a division by zero if
wav
happens to contain perfect silence:https://github.com/bshall/hifigan/blob/374a4569eae5437e2c80d27790ff6fede9fc1c46/hifigan/dataset.py#L106
I'm not sure what the best solution for this would be, as a quick fix I simply clipped the divisor so it can't reach zero: