Closed tugstugi closed 4 years ago
Please check out whether the GTA Mel frames align with the wav length. You can calculate the lengths of them both.
@begeekmyfriend
I have checked with https://github.com/begeekmyfriend/tacotron2/issues/4#issuecomment-564927028
All asserts are successfull and min and max values are: -4.0 -0.21728516
ok, it seems Tacotron train wasn't good enough even the Griffing Lim sounded good. Using GTA generated from little longer trained Tacotron (130k), the audios generated by WaveRNN are now getting little better.
I have some questions regarding to your 4 speaker Tacotron:
Thanks
200 epochs (not steps) for T2.
@begeekmyfriend I have also around 40k files (10k for each speaker) and trained 200 epochs. My final training loss is around 0.35 which is compared to the NVidia Tacotron2 too high. Is that normal?
You might try training 48h and see.
After training Tacotron2 400 epochs, the loss is improved to 0.31. After applying more aggresive silence trimming, the loss is now around 0.27. @begeekmyfriend How did you trim your dataset? Could you share your trim_top_db
for your dataset?
It does not matter with trim_top_db
, in fact it still clips the Mel value in preprocessing.
Made further experiments: single speaker T2 training loss around from 0.11 to 0.15, 2 speakers around 0.18, 3 speakers 0.23.
@begeekmyfriend I think I have found the cause why the WaveRNN produces such strange artifacts. I have trained T2 without --load-mel-from-disk
. In this case, the min mel values are around -12. GTA/WaveRNN uses -4 as clip/pad values and this seems to cause the artifacts. How did you choose -4 as clip/pad?
I have also made a pull request https://github.com/begeekmyfriend/tacotron2/pull/27 to fix the error mentioned in https://github.com/begeekmyfriend/tacotron2/issues/24
With -4 clipped mels, the 4 speaker T2 loss is now around 0.13, now I am training WaveRNN. Hopefully it solves the artifacts :)
You'd better cut the edge data of corpus since there are noises.
@begeekmyfriend you mean the trimming?
It just works with these hyper parameters and what we need to do is just follow them.
Now WaveRNN sounds after 200k steps ok: wavernn.zip
Hallo @begeekmyfriend,
I am trying to train 4 speaker (2 males and 2 women) WaveRNN model. I have successfully trained Tacotron. The wav files generated with Griffin Lim sound good. After that, I have generated gta files and now I am training WaveRNN, currently at 250k steps. But the WaveRNN samples sound really strange. I have 2 problems:
I have attached a sample target and generated wav files.
What could be the reason for that? If 250k steps are not enough to generate intelligable audios, why are the target wavs silent? Is that normal?
wavernn.zip