GST Prosody transfer on Tacotron 2 is not working

saibharani commented 2 years ago

Describe the bug

I trained an Tacotron2 GST model on LJspeech dataset and own Emotional dataset for 100k steps using use_gst=True, gst=GSTConfig(), options in training.

To Reproduce

But during inference the audio sounds same with or without using style_wav. I used from TTS.utils.synthesizer import Synthesizer function for synthesizing the text. Can you suggest any changes in config or any corrections to improve the prosody transfer quality.

Expected behavior

No response

Logs

No response

Environment

- TTS version 0.7.0
- Pytorch version 1.9.0
- CUDA version 10.2
- OS Ubuntu 18.04
- Python 3.8.13

Additional context

Any help is appreciated, thanks.

WeberJulian commented 2 years ago

Hey, LJSpeech might not be the best dataset to train GST on because of the lack of prosody. In the paper they use data from the 2013 Blizzard Challenge. But you might try with audio book data as well since it tends to be more expressive.

WeberJulian commented 2 years ago

You can also try inference by tweaking manually the style tokens to see if you can get variations.

saibharani commented 2 years ago

That makes sense I will try to train a new model using Blizzard dataset or any Audio book data and will give an update here.

Can you please elaborate on tweaking style tokens manually, do you mean by inputting a dict to gst_style_input_weights parameter? can you give any example of manual style tokens?

Thanks

erogol commented 2 years ago

I move it as it is not a functional issue.

coqui-ai / TTS