PaddlePaddle / PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.
https://paddlespeech.readthedocs.io
Apache License 2.0
11.22k stars 1.86k forks source link

tacotron2-ge2e synthesize poor quality audio with vocoder WaveGan #1258

Closed PingAnPH closed 2 years ago

PingAnPH commented 2 years ago

i want to implent multispeaker tts and i try to compare the two project in this repo. Since the tacotron2-ge2e and fastspeech2-ge2e use different vocoder(waveflow for t2-ge2e and wavegan for f2-ge2e), to figure out the best synthesizer, it is essential to use the same vocoder. So i change the vocoder of t2-ge2e from waveflow to wavegan, the same as f2-ge2e. But i found that the synthesized audio is poor with very low energy and human not understanding. While it has good perfomance in f2-ge2e. It seems that the synthesizers are coupled with the vocoders. Should it be like this?

Dose anyone know why? Thanks

yt605155624 commented 2 years ago

tacotron2-ge2e is an old version of voice cloning, which has poor quality I think.. I suggest you use fastspeech2-ge2e. Yes, the synthesizers are coupled with the vocoders, when training Acoustic models and Vocoders, they should have the same sample rate and hop_size (and other features such as n_fft, win_length). When training tacotron2 and waveflow, the sample rate is 20500(which is the sample rate of ljspeech dataset) and hop size is 256 https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/config.py#L21 https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/paddlespeech/t2s/exps/waveflow/config.py#L21 but when training fastspeech2 and pwgan, the sample rate is 24000 and the hop size(n_shift) is 300 https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/examples/aishell3/vc1/conf/default.yaml#L5

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/examples/aishell3/voc1/conf/default.yaml#L9

And in fastspeech2 and pwgan(and speedyspeech and other GAN Vocoder)we have normalize and denormalize the mel spectrum, only Tacotron2 and waveflow are not included in the standard pipeline of PaddleSpeech TTS (Training and Synthesizing)

I will try to improve the quality of Tacotron2 in the future. (I think both Tacotron2 and waveflow are not good enough now)

PingAnPH commented 2 years ago

tacotron2-ge2e is an old version of voice cloning, which has poor quality I think.. I suggest you use fastspeech2-ge2e. Yes, the synthesizers are coupled with the vocoders, when training Acoustic models and Vocoders, they should have the same sample rate and hop_size (and other features such as n_fft, win_length). When training tacotron2 and waveflow, the sample rate is 20500(which is the sample rate of ljspeech dataset) and hop size is 256

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/config.py#L21

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/paddlespeech/t2s/exps/waveflow/config.py#L21

but when training fastspeech2 and pwgan, the sample rate is 24000 and the hop size(n_shift) is 300 https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/examples/aishell3/vc1/conf/default.yaml#L5

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/examples/aishell3/voc1/conf/default.yaml#L9

And in fastspeech2 and pwgan(and speedyspeech and other GAN Vocoder)we have normalize and denormalize the mel spectrum, only Tacotron2 and waveflow are not included in the standard pipeline of PaddleSpeech TTS (Training and Synthesizing)

I will try to improve the quality of Tacotron2 in the future. (I think both Tacotron2 and waveflow are not good enough now)

Get it! Thank you!

PingAnPH commented 2 years ago

tacotron2-ge2e is an old version of voice cloning, which has poor quality I think.. I suggest you use fastspeech2-ge2e. Yes, the synthesizers are coupled with the vocoders, when training Acoustic models and Vocoders, they should have the same sample rate and hop_size (and other features such as n_fft, win_length). When training tacotron2 and waveflow, the sample rate is 20500(which is the sample rate of ljspeech dataset) and hop size is 256

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/config.py#L21

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/paddlespeech/t2s/exps/waveflow/config.py#L21

but when training fastspeech2 and pwgan, the sample rate is 24000 and the hop size(n_shift) is 300 https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/examples/aishell3/vc1/conf/default.yaml#L5

https://github.com/PaddlePaddle/PaddleSpeech/blob/f27d9d50e60e1bf7762fa7b08d21504d63f75358/examples/aishell3/voc1/conf/default.yaml#L9

And in fastspeech2 and pwgan(and speedyspeech and other GAN Vocoder)we have normalize and denormalize the mel spectrum, only Tacotron2 and waveflow are not included in the standard pipeline of PaddleSpeech TTS (Training and Synthesizing)

I will try to improve the quality of Tacotron2 in the future. (I think both Tacotron2 and waveflow are not good enough now)

Another question is that if i have trained different synthesizers and vocoders with the same sample_rate and hop_size, could i use them with different combinations?

yt605155624 commented 2 years ago

of course you can,you can try demos/tts you can choose speedyspeech/fastspeech2 + pwgan /mb melgan / hifigan / style melgan,the default is fastspeech2 + pagan,but if you use a vocoder trained by a female voice to synthesize a male's voice, you will get a bad quality, and verse vasa , a multi speaker vocoder maybe better here.

PingAnPH commented 2 years ago

ok, thank you.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 2 years ago

This issue is closed. Please re-open if needed.