Closed BillyBobQuebec closed 3 years ago
You can check if it is about r. You can init the model with the default r value then change it to 2 and run the inference.
Ok I have done that now, it appears that the r-value is not the issue as it sounds the same, is it normal for inferences to sound like this at these points in training? or do you think there is another issue possibly occurring?
Try disabling mixed_precision
in the config file. It causes issues on some systems.
Also, you can check the working released models and try copying their config for your run. Maybe I missed something as I was updating TTS with the new Trainer API.
The recipe that was used already has mixed-precision set to false, I tried using the pre-trained model's config for this model instead of the recipe config, I attempted to inference in a variety of ways with this other config to make it work, but was only able to get this.
size mismatch for embedding.weight: copying a param with shape torch.Size([182, 512]) from checkpoint, the shape in current model is torch.Size([64, 512]).
size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([480, 1536]) from checkpoint, the shape in current model is torch.Size([160, 1536]).
size mismatch for decoder.linear_projection.linear_layer.bias: copying a param with shape torch.Size([480]) from checkpoint, the shape in current model is torch.Size([160]).
size mismatch for decoder.stopnet.1.linear_layer.weight: copying a param with shape torch.Size([1, 1504]) from checkpoint, the shape in current model is torch.Size([1, 1184]).
Update: I was able to fix all the size mismatch errors by making these changes to the pretrained config: (Enable "double_decoder_consistency", remove "characters", set "r": 6, and set "ddc_r": 6) the resulting audio still sounded identical to what I attached at the top of this thread.
you need to retrain with the new config, especially if there are different audio parameters. It is not enough to change it for only inference.
Okay, but the pretrained config has double decoder consistency disabled for some reason... so do I enable that and keep everything else the same in the config?
you can enable it and keep the rest the same.
It is disabled since the 2nd decoder is removed to reduce the model size.
Okay attempting that now, at what point would you recommend attempting inference and checking audio quality? total epochs is 1,000 epochs.
In general, after 20k steps, it should start producing understandable speech.
In general, after 20k steps, it should start producing understandable speech.
It's now at about 100K steps (Training with the pretrained model config, only change made is setting DDC to true.) Here is how it sounds at 20K, 60K, and 80K steps. Is your statement about producing understandable speech in reference to gradual training specifically perhaps? Because the pretrained config does not have gradual training and seems to be training slower than expected. Here is the tensorboard if that helps: https://tensorboard.dev/experiment/Ku5pmxY2QVWrasO3SQ7mBQ/#scalars
Why there is no image on the Tensorboard? there should be alignment images.
But looking at the audio samples, there looks to be something is broken in the model or the configs you use.
Also pls share the alignment images too, to see if the bug is in the inference or the training code.
Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?: tensorboard dev upload --logdir .
Since two weeks I have the same type of problems with Tacotron2-DDC Inference. My trained models with version 0.1.2 looks fine in Tensorboard and the audio in Tensorboard is intelligible, but the inference audio is broken. Until now I searched for errors in my settings, but the present issue description by Billy Bob makes me think that there is really a problem with inference. My understanding is that the models released in the past should work with the latest Coqui-TTS versions. Therefore I did some inference tests with the Tacotron2-DDC LJSpeech model, released in April 2021.
I used the following script
tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
--model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
--config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
--out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_version.wav
and started with version 0.0.12 (git checkout a53958a). It works as expected.
Here are the logs, the signal-figure and the sound:
(coqui-tts) mbarnig@mbarnig-MS-7B22:~/coqui-tts/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.0.12.wav
> Downloading model to /home/mbarnig/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC
> Downloading model to /home/mbarnig/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2
> Using model: Tacotron2
> Generator Model: hifigan_generator
Removing weight norm...
> Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
> Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
> Processing time: 2.3137736320495605
> Real-time factor: 0.3193459475882124
> Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.0.12.wav
https://user-images.githubusercontent.com/1360633/126529582-dfb4b79b-413a-4d72-a950-50b54d41544c.mp4
Version 0.0.13 (git checkout f02f033) works also fine.
In version 0.0.14 (git checkout 5482a0f) the following error is reported:
File "/home/mbarnig/coqui-tts/lib/python3.8/site-packages/coqpit/coqpit.py", line 856, in check_argument
assert os.path.exists(c[name]), f' [!] path for {name} ("{c[name]}") does not exist.'
AssertionError: [!] path for pad ("") does not exist.
I was not able to debug this problem and could not check if the inference is working.
Version 0.0.15 (git checkout b8b79a5) shows no errors in the logs, but the sound is bad.
https://user-images.githubusercontent.com/1360633/126529666-d013c89c-a29d-4643-9fbc-2b116685e1cb.mp4
Same results for version 0.0.15.1 (git checkout d245b5d)
Versions 0.1.0 (git checkout c25a218), 0.1.1 (git checkout 676d22f), 0.1.2 (git checkout 8fbadad) and main show a warning that the decoder stopped with 'max_decoder_steps' 500
. In all cases the sound is broken, as in version 0.0.15.
Here are the logs, the signal-figure and the sound for the latest version 0.1.2:
(coqui-tts) mbarnig@mbarnig-MS-7B22:~/coqui-tts/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.1.2.wav
> Using model: Tacotron2
> Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
> Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
> Decoder stopped with `max_decoder_steps` 500
> Processing time: 2.9582419395446777
> Real-time factor: 0.47355409140841087
> Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.1.2.wav
https://user-images.githubusercontent.com/1360633/126530413-160a6962-7ce5-4ddf-be43-62532bb54f89.mp4
I hope my report helps to solve the problem.
Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?:
tensorboard dev upload --logdir .
yes it should show all these.
Why don't use just run tensorboard locally ? Maybe uploading breaks things
@mbarnig very helpful !! Thx for going under the hood.
So it looks like we have something wrong after 0.15
I'll check and try to find that little 🐛
To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings:
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/GlowTTS-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/GlowTTS-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
> Using model: glow_tts
> Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
> Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
> Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
> Processing time: 3.61332106590271
> Real-time factor: 0.4497072242343694
> Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
https://user-images.githubusercontent.com/1360633/126630757-181621f0-ecf3-4954-b56c-471993145ea0.mp4
I changed the stats_path
in the config-file to adapt to my environment.
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
> Using model: Tacotron2
> Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
> Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
> Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
> Processing time: 2.646209239959717
> Real-time factor: 0.384968553659821
> Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
https://user-images.githubusercontent.com/1360633/126630824-6d180ef0-a616-43b6-8464-d17e952fae5e.mp4
I think that this audio has also some problems, but I was not able to compare it with the the released version.
I changed the stats_path
in the config-file to adapt to my environment. The inference script
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/speedyspeech-ljspeech_v0.1.2.wav
> Using model: speedy_speech
Traceback (most recent call last):
File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
synthesizer = Synthesizer(
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "/home/mbarnig/recipe/TTS/TTS/tts/models/speedy_speech.py", line 310, in load_checkpoint
self.load_state_dict(state["model"])
File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
Missing key(s) in state_dict: "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.weight", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.bias", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.norm.weight",
..........................
"decoder.decoder.postnet.4.weight", "decoder.decoder.postnet.4.bias", "decoder.decoder.postnet.6.weight", "decoder.decoder.postnet.6.bias", "decoder.decoder.postnet.0.weight", "decoder.decoder.postnet.0.bias".
fails with a Missing key(s) in state_dict RuntimeError
.
I changed the stats_path
in the config-file to my environment and added the speaker_idx to the script. The inference
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/config.json \
> --speaker_idx p225 \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/sc-glowtts-vctk_v0.1.2.wav
> Using model: glow_tts
> Training with 0 speakers:
Traceback (most recent call last):
File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
synthesizer = Synthesizer(
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "/home/mbarnig/recipe/TTS/TTS/tts/models/glow_tts.py", line 386, in load_checkpoint
self.load_state_dict(state["model"])
File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
Missing key(s) in state_dict: "emb_g.weight".
Unexpected key(s) in state_dict: "decoder.flows.2.wn.cond_layer.bias", "decoder.flows.2.wn.cond_layer.weight_g", "decoder.flows.2.wn.cond_layer.weight_v", "decoder.flows.5.wn.cond_layer.bias", "decoder.flows.5.wn.cond_layer.weight_g", "decoder.flows.5.wn.cond_layer.weight_v", "decoder.flows.8.wn.cond_layer.bias", "decoder.flows.8.wn.cond_layer.weight_g", "decoder.flows.8.wn.cond_layer.weight_v", "decoder.flows.11.wn.cond_layer.bias", "decoder.flows.11.wn.cond_layer.weight_g", "decoder.flows.11.wn.cond_layer.weight_v", "decoder.flows.14.wn.cond_layer.bias", "decoder.flows.14.wn.cond_layer.weight_g", "decoder.flows.14.wn.cond_layer.weight_v", "decoder.flows.17.wn.cond_layer.bias", "decoder.flows.17.wn.cond_layer.weight_g", "decoder.flows.17.wn.cond_layer.weight_v", "decoder.flows.20.wn.cond_layer.bias", "decoder.flows.20.wn.cond_layer.weight_g", "decoder.flows.20.wn.cond_layer.weight_v", "decoder.flows.23.wn.cond_layer.bias", "decoder.flows.23.wn.cond_layer.weight_g", "decoder.flows.23.wn.cond_layer.weight_v", "decoder.flows.26.wn.cond_layer.bias", "decoder.flows.26.wn.cond_layer.weight_g", "decoder.flows.26.wn.cond_layer.weight_v", "decoder.flows.29.wn.cond_layer.bias", "decoder.flows.29.wn.cond_layer.weight_g", "decoder.flows.29.wn.cond_layer.weight_v", "decoder.flows.32.wn.cond_layer.bias", "decoder.flows.32.wn.cond_layer.weight_g", "decoder.flows.32.wn.cond_layer.weight_v", "decoder.flows.35.wn.cond_layer.bias", "decoder.flows.35.wn.cond_layer.weight_g", "decoder.flows.35.wn.cond_layer.weight_v".
size mismatch for encoder.duration_predictor.conv_1.weight: copying a param with shape torch.Size([256, 448, 3]) from checkpoint, the shape in current model is torch.Size([256, 192, 3]).
fails with Errors in loading state_dict RuntimeError
.
Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors.
To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings:
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/GlowTTS-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/GlowTTS-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
> Using model: glow_tts
> Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
> Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
> Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
> Processing time: 3.61332106590271
> Real-time factor: 0.4497072242343694
> Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
https://user-images.githubusercontent.com/1360633/126630757-181621f0-ecf3-4954-b56c-471993145ea0.mp4
I changed the stats_path
in the config-file to adapt to my environment.
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
> Using model: Tacotron2
> Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
> Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
> Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
> Processing time: 2.646209239959717
> Real-time factor: 0.384968553659821
> Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
https://user-images.githubusercontent.com/1360633/126630824-6d180ef0-a616-43b6-8464-d17e952fae5e.mp4
I think that this audio has also some problems, but I was not able to compare it with the the released version.
I changed the stats_path
in the config-file to adapt to my environment. The inference script
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/speedyspeech-ljspeech_v0.1.2.wav
> Using model: speedy_speech
Traceback (most recent call last):
File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
synthesizer = Synthesizer(
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "/home/mbarnig/recipe/TTS/TTS/tts/models/speedy_speech.py", line 310, in load_checkpoint
self.load_state_dict(state["model"])
File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
Missing key(s) in state_dict: "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.weight", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.bias", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.norm.weight",
..........................
"decoder.decoder.postnet.4.weight", "decoder.decoder.postnet.4.bias", "decoder.decoder.postnet.6.weight", "decoder.decoder.postnet.6.bias", "decoder.decoder.postnet.0.weight", "decoder.decoder.postnet.0.bias".
fails with a Missing key(s) in state_dict RuntimeError
.
I changed the stats_path
in the config-file to my environment and added the speaker_idx to the script. The inference
(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/config.json \
> --speaker_idx p225 \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/sc-glowtts-vctk_v0.1.2.wav
> Using model: glow_tts
> Training with 0 speakers:
Traceback (most recent call last):
File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
synthesizer = Synthesizer(
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "/home/mbarnig/recipe/TTS/TTS/tts/models/glow_tts.py", line 386, in load_checkpoint
self.load_state_dict(state["model"])
File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
Missing key(s) in state_dict: "emb_g.weight".
Unexpected key(s) in state_dict: "decoder.flows.2.wn.cond_layer.bias", "decoder.flows.2.wn.cond_layer.weight_g", "decoder.flows.2.wn.cond_layer.weight_v", "decoder.flows.5.wn.cond_layer.bias", "decoder.flows.5.wn.cond_layer.weight_g", "decoder.flows.5.wn.cond_layer.weight_v", "decoder.flows.8.wn.cond_layer.bias", "decoder.flows.8.wn.cond_layer.weight_g", "decoder.flows.8.wn.cond_layer.weight_v", "decoder.flows.11.wn.cond_layer.bias", "decoder.flows.11.wn.cond_layer.weight_g", "decoder.flows.11.wn.cond_layer.weight_v", "decoder.flows.14.wn.cond_layer.bias", "decoder.flows.14.wn.cond_layer.weight_g", "decoder.flows.14.wn.cond_layer.weight_v", "decoder.flows.17.wn.cond_layer.bias", "decoder.flows.17.wn.cond_layer.weight_g", "decoder.flows.17.wn.cond_layer.weight_v", "decoder.flows.20.wn.cond_layer.bias", "decoder.flows.20.wn.cond_layer.weight_g", "decoder.flows.20.wn.cond_layer.weight_v", "decoder.flows.23.wn.cond_layer.bias", "decoder.flows.23.wn.cond_layer.weight_g", "decoder.flows.23.wn.cond_layer.weight_v", "decoder.flows.26.wn.cond_layer.bias", "decoder.flows.26.wn.cond_layer.weight_g", "decoder.flows.26.wn.cond_layer.weight_v", "decoder.flows.29.wn.cond_layer.bias", "decoder.flows.29.wn.cond_layer.weight_g", "decoder.flows.29.wn.cond_layer.weight_v", "decoder.flows.32.wn.cond_layer.bias", "decoder.flows.32.wn.cond_layer.weight_g", "decoder.flows.32.wn.cond_layer.weight_v", "decoder.flows.35.wn.cond_layer.bias", "decoder.flows.35.wn.cond_layer.weight_g", "decoder.flows.35.wn.cond_layer.weight_v".
size mismatch for encoder.duration_predictor.conv_1.weight: copying a param with shape torch.Size([256, 448, 3]) from checkpoint, the shape in current model is torch.Size([256, 192, 3]).
fails with Errors in loading state_dict RuntimeError
.
Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors.
yes it should show all these.
Why don't use just run tensorboard locally? Maybe uploading breaks things
I found alignment images and this is how it looks at 180K:
also this is the test audio that I found as well for 187K
Definitely very different audio and finally understandable!
The alignment looks good enough. I guess the issue you experience is about a bug in the stopnet (which decides when to stop for the model). I am working on it and will release the fix soon. Until then, just keep training the model and after the release, you can continue training with the new version and fix the stopnet.
@erogol Thank you for pushing the new update! I'm currently training with the same config used at the beginning of this issue, except I'm now training using your new V0.1.3 to see if the stop net was the problem and if I can finally replicate the inference quality in the current pretrained voice.
Hello @erogol !
So I have made a few interesting observations lately regarding this issue:
When I inference my model that I trained with your pretrained config, along with the stop net fix, I definitely get significantly better results with it compared to the one I trained before the stop net fix, this difference is clear regardless of vocoder used, that being said...
When inferencing this better DDC model while using the pretrained hifigan as the vocoder, the inference quality is significantly worse than both multi-band melgan and griffin-lim, but from my experience, it should have significantly better quality than griffin-lim, and at least similar quality to multi-band melgan.
Here is audio of my model inferencing with Griffin-Lim, Multi-band Melgan, and finally hifi-gan.
Loudness warning
On the contrary, the pretrained DDC model that comes with coqui works perfectly well with hifigan, but seems to be significantly worse audio quality when inferenced with multi-band melgan, it seems that the opposite trend is occurring compared to the DDC model I trained, with the supposedly same config. Funnily enough, the high pitch artifacts and features of the audio seem similar to what happens when I use my own ddc checkpoint on hifigan, Here is audio of that poor inference quality occurring with the Coqui pretrained DCC + multi-band melgan:
So overall, the stop net fix definitely helped, but there is something more occurring here unfortunately, that is preventing my Tacotron2DDC model from reaching the expected quality that Tacotron2DDC + Hifigan has to offer.
Thanks for the update.
Vocoder model should match the audio parameters of the tts model. Have you checked ?
Ah it seems like that may be the problem, looks like I was using a slightly different config, I also noticed some things that might've affected training. I'm going to train from scratch again and make sure everything is correct, i'll post an update of how inferencing sounds with hifi-gan within the next few days.
Ok, I double-checked this time that all audio parameters were the same across both Hifi-gan and Tacotron 2 DDC, also made sure I was using the new Stop net code, along with correctly calculating scale_stats.py which I had some mishaps with before. With all of this I trained a model with the config to about 80K steps and tried inferencing it again with Hifi-gan, unfortunately, that screeching high pitch feature is still strongly present, and it doesn't seem to have improved in that regard.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Discussed in https://github.com/coqui-ai/TTS/discussions/639