fatchord / WaveRNN

WaveRNN Vocoder + TTS
https://fatchord.github.io/model_outputs/
MIT License
2.12k stars 697 forks source link

bad cases of waves generated by WaveRNN conditioned on mel #73

Open MorganCZY opened 5 years ago

MorganCZY commented 5 years ago

I've trained WaveRNN on LJSpeech dataset with mel as condition. When generating waves, there are some bad cases occasionally shown in the following pictures.(They are the same sentence generated at different training steps) image image Besides, WaveRNN can sometimes get perfect waves, shown in the bottom picture. image I wonder the reason. Are there any inappropriate operations during preprocessing stage, or any other reasons?

oytunturk commented 5 years ago

Mode collapse and slowness are two common issues with Wavenet and related neural vocoders. What you observe looks like a typical mode collapse situation. Flow based models and FFT or LPC domain neural vocoders might work better to avoid these issues. However, they still lack the quality that Wavenet-like vocoders provide.

fatchord commented 5 years ago

@MorganCZY Can you post a zip with the samples generated please? How many steps did you train for?

MorganCZY commented 5 years ago

@fatchord here are the three waves demonstrated above, generated at steps 952k, 956k, 959k respectively. samples.zip I've trained this model for all 1000k steps. The above phenomenon still exits occasionally.

fatchord commented 5 years ago

@MorganCZY I've never heard artifacts that loud in my own experiments. It could be just non-ideal local minima that you've found yourself in. Have you tried converging the model by reducing the learning rate down to 5e-5 for an hour or so, then 2e-5 and so on?

MorganCZY commented 5 years ago

@fatchord I didn't change learning rate manually, just using your scripts to finish 1000k steps training. The final loss is around 4.0 on LJSpeech dataset. Is this loss ideal compared to the well-trained model's loss? Besides, you mean the above phenomenon may result in local minima?

hyomuk-kim commented 5 years ago

I've also met same phenomenon on samples got from Wavernn model I trained. ( the phenomenon could be expressed as.. intermittent energy excess? ) The model was trained up to 800K only with gta features from the pretrained Tacotron model. It emerged at some steps unrelated to whether it is on 'batched' mode or not.

MorganCZY commented 5 years ago

@gyanr0425 I just used the true mel as input to train WaveRNN. So I guess the "intermittent energy excess" problem is due to the design of model architecture or the corpus. Looking forward to an explicit reason and a workable solution.

fatchord commented 5 years ago

@MorganCZY Have you tried converging the model? You can override the hparam learning rate - use train_wavernn.py -h to see the options for the script.

hyomuk-kim commented 5 years ago

I am also training the model with true mel (400K) and gta (400K) same as the pre-trained model. After checking the result of it, I will follow the guide of @fatchord to converge the model with lower learning rate. for the true mel 400K result, it showed me the same phenomenon, too. But first, I wonder @MorganCZY 's result of applying lower learning rate. Now for my model, mel 400K + gta 200K results are fine, even if its sound quality is worse a little than that of the pre-trained model.

fatchord commented 5 years ago

@gyanr0425 One small thing (I'll be updating the readme soon with tips and tricks like this one) - MOL needs a lot of training steps to start sounding good. Around 800k - 1M in my experiments so far. The RAW bits mode is much faster - you can get ok sounding 8bit samples trained in less than 300k steps (if my memory serves me correctly).

chen849157649 commented 5 years ago

@fatchord , hello , training the model with true mel (400K) and gta (400K) , Meaning that train_tacotron with true mel(400k), and then train_wavernn with gta(400k) ???, I do not quite understand,could you give some explain? thank you ,brother.