Closed saibharani closed 2 years ago
Hey, LJSpeech might not be the best dataset to train GST on because of the lack of prosody. In the paper they use data from the 2013 Blizzard Challenge. But you might try with audio book data as well since it tends to be more expressive.
You can also try inference by tweaking manually the style tokens to see if you can get variations.
That makes sense I will try to train a new model using Blizzard dataset or any Audio book data and will give an update here.
Can you please elaborate on tweaking style tokens manually, do you mean by inputting a dict to gst_style_input_weights parameter? can you give any example of manual style tokens?
Thanks
I move it as it is not a functional issue.
Describe the bug
I trained an Tacotron2 GST model on LJspeech dataset and own Emotional dataset for 100k steps using
use_gst=True, gst=GSTConfig(),
options in training.To Reproduce
But during inference the audio sounds same with or without using style_wav. I used
from TTS.utils.synthesizer import Synthesizer
function for synthesizing the text. Can you suggest any changes in config or any corrections to improve the prosody transfer quality.Expected behavior
No response
Logs
No response
Environment
Additional context
Any help is appreciated, thanks.