NVIDIA / waveglow

A Flow-based Generative Network for Speech Synthesis
BSD 3-Clause "New" or "Revised" License
2.26k stars 529 forks source link

[Quality] Not quite there yet with training. How to improve? #170

Closed cduguet closed 4 years ago

cduguet commented 4 years ago

Hello, I have been training my waveglow network from scratch for over 1000+ epochs (from a german dataset (LJSpeech), duration of 39 hours at 16KHz).

The quality still has some issues though. The voice sounds gargling and coarse. I tried denoising and controlling sigma, but not improving much. Here are inferring samples from a from-audio-generated mel spectrogram.

-You don't need to listen to all audios, just to ORIGINAL, BEST and TACOTRON. The other are auxiliaries in case of wondering how would tuning improve results.

[ORIGINAL]

[BEST] denoiser_strength=0.1, sigma=0.666

denoiser_strength=0.00, sigma=0.666

denoiser_strength=0.01, sigma=0.666

denoiser_strength=0.01, sigma=0.8:

denoiser_strength=0.01, sigma=0.4

Surprisingly enough, even though I trained with audiowaves, inference with the Tacotron-generated melspecs sounds better:

[TACOTRON] Generated Mel Spectrogram with denoiser_strength=0.01, sigma=0.666

The question is: Does someone have experience getting rid of this gargling in the voice? Does training further help? For me the training curve has flattened since around epoch 900 already with a loss of around -6.0.

Thank you for your suggestions!

rafaelvalle commented 4 years ago

@cduguet Cristian, download the WaveGlow weights we used with Mellotron and let us know if it sounds better. WaveGlow weights

cduguet commented 4 years ago

Thank you! I will try them out. Did you train it on Mellotron outputs or melspecs created from the audio files?

rafaelvalle commented 4 years ago

It was trained on studio quality audio files from a female speaker.

rafaelvalle commented 4 years ago

Closing due to inactivity.