Hello!
I am trying to train on train-clean-100 subset of LibriTTS.
I have resampled all of them to be 22khz (example), and reusing the filelists file as was provided in original repository. I was able to successfully run them for about 15000 iterations on a T4 GPU. (g4dn.xlarge instance on AWS)
When I tried turning on mixed precision training by setting tp16_run=True on the same instance, it runs for a few iterations, then runs into gradient overflows. It keeps trying to decrease the loss scale by 2 until ~1e-100 (at which point i stopped). The loss is NaN rather than inf, which according to Apex github issue I should not be observing.
Wondering if anyone has an idea why this might be happening.
+I am also wondering how many iterations the uploaded models for LibriTTS and LJS was trained for - is that some information that the team could share?
Hello! I am trying to train on train-clean-100 subset of LibriTTS. I have resampled all of them to be 22khz (example), and reusing the
filelists
file as was provided in original repository. I was able to successfully run them for about 15000 iterations on a T4 GPU. (g4dn.xlarge instance on AWS)When I tried turning on mixed precision training by setting
tp16_run=True
on the same instance, it runs for a few iterations, then runs into gradient overflows. It keeps trying to decrease the loss scale by 2 until ~1e-100 (at which point i stopped). The loss is NaN rather than inf, which according to Apex github issue I should not be observing.Wondering if anyone has an idea why this might be happening.![image](https://user-images.githubusercontent.com/14953749/82131589-8d97b700-978b-11ea-8cb4-d8cb121a60bf.png)
+I am also wondering how many iterations the uploaded models for LibriTTS and LJS was trained for - is that some information that the team could share?
Thank you in advance!