WaveRNN generated samples sound strange

begeekmyfriend / tacotron2

Forked from NVIDIA/tacotron2 and merged with Rayhane-mamah/Tacotron-2

BSD 3-Clause "New" or "Revised" License

81 stars 38 forks source link

WaveRNN generated samples sound strange #26

Closed tugstugi closed 4 years ago

tugstugi commented 4 years ago

Hallo @begeekmyfriend,

I am trying to train 4 speaker (2 males and 2 women) WaveRNN model. I have successfully trained Tacotron. The wav files generated with Griffin Lim sound good. After that, I have generated gta files and now I am training WaveRNN, currently at 250k steps. But the WaveRNN samples sound really strange. I have 2 problems:

target wavs have no sound at all
generated wavs sound really strange: overlapping artifacts and not intelligable

I have attached a sample target and generated wav files.

What could be the reason for that? If 250k steps are not enough to generate intelligable audios, why are the target wavs silent? Is that normal?

wavernn.zip

begeekmyfriend commented 4 years ago

Please check out whether the GTA Mel frames align with the wav length. You can calculate the lengths of them both.

tugstugi commented 4 years ago

@begeekmyfriend

I have checked with https://github.com/begeekmyfriend/tacotron2/issues/4#issuecomment-564927028

All asserts are successfull and min and max values are: -4.0 -0.21728516

tugstugi commented 4 years ago

ok, it seems Tacotron train wasn't good enough even the Griffing Lim sounded good. Using GTA generated from little longer trained Tacotron (130k), the audios generated by WaveRNN are now getting little better.

I have some questions regarding to your 4 speaker Tacotron:

How many iterations did you train?
What is your final training loss?

Thanks

begeekmyfriend commented 4 years ago

200 epochs (not steps) for T2.

tugstugi commented 4 years ago

@begeekmyfriend I have also around 40k files (10k for each speaker) and trained 200 epochs. My final training loss is around 0.35 which is compared to the NVidia Tacotron2 too high. Is that normal?

begeekmyfriend commented 4 years ago

You might try training 48h and see.

tugstugi commented 4 years ago

After training Tacotron2 400 epochs, the loss is improved to 0.31. After applying more aggresive silence trimming, the loss is now around 0.27. @begeekmyfriend How did you trim your dataset? Could you share your trim_top_db for your dataset?

begeekmyfriend commented 4 years ago

It does not matter with trim_top_db, in fact it still clips the Mel value in preprocessing.

tugstugi commented 4 years ago

Made further experiments: single speaker T2 training loss around from 0.11 to 0.15, 2 speakers around 0.18, 3 speakers 0.23.

tugstugi commented 4 years ago

@begeekmyfriend I think I have found the cause why the WaveRNN produces such strange artifacts. I have trained T2 without --load-mel-from-disk. In this case, the min mel values are around -12. GTA/WaveRNN uses -4 as clip/pad values and this seems to cause the artifacts. How did you choose -4 as clip/pad?

I have also made a pull request https://github.com/begeekmyfriend/tacotron2/pull/27 to fix the error mentioned in https://github.com/begeekmyfriend/tacotron2/issues/24

begeekmyfriend commented 4 years ago

See https://github.com/begeekmyfriend/tacotron2/issues/17#issuecomment-667647540 and https://github.com/begeekmyfriend/WaveRNN/commit/7e1d4032ae89244945b8eb1216852a48305b4e99

tugstugi commented 4 years ago

With -4 clipped mels, the 4 speaker T2 loss is now around 0.13, now I am training WaveRNN. Hopefully it solves the artifacts :)

begeekmyfriend commented 4 years ago

You'd better cut the edge data of corpus since there are noises.

tugstugi commented 4 years ago

@begeekmyfriend you mean the trimming?

begeekmyfriend commented 4 years ago

It just works with these hyper parameters and what we need to do is just follow them.

tugstugi commented 4 years ago

Now WaveRNN sounds after 200k steps ok: wavernn.zip