Gradient overflow with Mixed Precision Training

Hello! I am trying to train on train-clean-100 subset of LibriTTS. I have resampled all of them to be 22khz (example), and reusing the filelists file as was provided in original repository. I was able to successfully run them for about 15000 iterations on a T4 GPU. (g4dn.xlarge instance on AWS)

When I tried turning on mixed precision training by setting tp16_run=True on the same instance, it runs for a few iterations, then runs into gradient overflows. It keeps trying to decrease the loss scale by 2 until ~1e-100 (at which point i stopped). The loss is NaN rather than inf, which according to Apex github issue I should not be observing.

Wondering if anyone has an idea why this might be happening.

+I am also wondering how many iterations the uploaded models for LibriTTS and LJS was trained for - is that some information that the team could share?

Thank you in advance!

NVIDIA / mellotron

Gradient overflow with Mixed Precision Training #63