Closed Henryplay closed 7 months ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Describe the bug
I've checked the history issues about speedyspeech, can not find the same question. In the past days, i have trained the speedyspeech model on ljspeech with 1000 epochs, the vocode used the pretrained hifigan model, the systhesised wav sounds not good, than i changed the asoustic speedyspeech model with pretrained model, the wav sounds good, and the synthesis command is
To Reproduce
1, I've used the
recipes/ljspeech/speedy_speech/train_speedy_speech.py
script with commandCUDA_VISIBLE_DEVICES=0,1,2,3 python -m trainer.distribute --script train_speedy_speech.py
. here is the code2, here is the output config file code
Expected behavior
In the speedyspeech paper, it is necessary to first train a Teacher Model to extract durations, and then use the extracted durations to train the speedspeed Student model. However, in our code, there is no step to train a Teacher Model, and I see
If the alignment network is used, the model learns the text-to-speech alignment from the data instead of using pre-computed durations.
inTTS/tts/models/forward_tts.py
,LINE177
. I am a bit confused about this.Is this the reason why the model I trained doesn't perform wellLogs
Additional context
No response