coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.69k stars 4.21k forks source link

Inference/recipe not working properly. #640

Closed BillyBobQuebec closed 3 years ago

BillyBobQuebec commented 3 years ago

Discussed in https://github.com/coqui-ai/TTS/discussions/639

Originally posted by **BillyBobQuebec** July 10, 2021 I am training Tacotron2-DDC (LJ) from scratch using the recipe provided with no changes, [tensorboard ](https://tensorboard.dev/experiment/x2oSvIvwTEqDA9zUkoXtVw/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D) looks good to my eyes, but the alignment and duration seem to be way off when I actually try inferencing the audio, I'm suspecting it's a problem with it being inferenced improperly. specifically, the r-value that it is attempting to inference with. Here is the command that I used to initiate training: ``` cd ~/repo/coqui-clean bash recipes/ljspeech/tacotron2-DDC/run.sh ``` Since the recipe uses gradual training which uses "r" as the starting value for the fine decoder (if I understand it correctly), but then changes it over time during training, I suspect it's using the starting r value during inference, instead of the latest r-value the fine decoder was at during training, when I try to force a different r value for inference (by passing through a recipe config with `"r": 2,` instead of `"r": 6,` ) it gives me this error: ``` RuntimeError: Error(s) in loading state_dict for Tacotron2: size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([480, 1536]) from checkpoint, the shape in current model is torch.Size([160, 1536]). size mismatch for decoder.linear_projection.linear_layer.bias: copying a param with shape torch.Size([480]) from checkpoint, the shape in current model is torch.Size([160]). size mismatch for decoder.stopnet.1.linear_layer.weight: copying a param with shape torch.Size([1, 1504]) from checkpoint, the shape in current model is torch.Size([1, 1184]). ``` Here's the command used for inferencing and here's how it sounds at different points: ``` cd ~ cp ~/repo/coqui-clean/recipes/ljspeech/tacotron2-DDC/scale_stats.npy . cp ~/repo/coqui-clean/recipes/ljspeech/tacotron2-DDC/tacotron2-DDC.json config.json CUDA_VISIBLE_DEVICES="" tts \ --text "Hello I bought this T.V. today, and it's cold outside. I should probably grab my sweater and go to your moms house." \ --model_path ~/repo/coqui-clean/recipes/ljspeech/tacotron2-DDC/ljspeech-ddc-July-06-2021_09+10AM-8fbadad6/checkpoint_280000.pth.tar \ --config_path config.json \ --out_path output.wav ``` https://user-images.githubusercontent.com/74849975/125181759-267e6280-e1d6-11eb-8e81-1de96b682245.mp4 https://user-images.githubusercontent.com/74849975/125181760-28e0bc80-e1d6-11eb-90fc-8392e5223fc7.mp4 https://user-images.githubusercontent.com/74849975/125181761-2aaa8000-e1d6-11eb-8477-9a6c1b59843f.mp4
erogol commented 3 years ago

You can check if it is about r. You can init the model with the default r value then change it to 2 and run the inference.

BillyBobQuebec commented 3 years ago

Ok I have done that now, it appears that the r-value is not the issue as it sounds the same, is it normal for inferences to sound like this at these points in training? or do you think there is another issue possibly occurring?

erogol commented 3 years ago

Try disabling mixed_precision in the config file. It causes issues on some systems.

Also, you can check the working released models and try copying their config for your run. Maybe I missed something as I was updating TTS with the new Trainer API.

BillyBobQuebec commented 3 years ago

The recipe that was used already has mixed-precision set to false, I tried using the pre-trained model's config for this model instead of the recipe config, I attempted to inference in a variety of ways with this other config to make it work, but was only able to get this.

size mismatch for embedding.weight: copying a param with shape torch.Size([182, 512]) from checkpoint, the shape in current model is torch.Size([64, 512]).
size mismatch for decoder.linear_projection.linear_layer.weight: copying a param with shape torch.Size([480, 1536]) from checkpoint, the shape in current model is torch.Size([160, 1536]).
size mismatch for decoder.linear_projection.linear_layer.bias: copying a param with shape torch.Size([480]) from checkpoint, the shape in current model is torch.Size([160]).
size mismatch for decoder.stopnet.1.linear_layer.weight: copying a param with shape torch.Size([1, 1504]) from checkpoint, the shape in current model is torch.Size([1, 1184]).

Update: I was able to fix all the size mismatch errors by making these changes to the pretrained config: (Enable "double_decoder_consistency", remove "characters", set "r": 6, and set "ddc_r": 6) the resulting audio still sounded identical to what I attached at the top of this thread.

erogol commented 3 years ago

you need to retrain with the new config, especially if there are different audio parameters. It is not enough to change it for only inference.

BillyBobQuebec commented 3 years ago

Okay, but the pretrained config has double decoder consistency disabled for some reason... so do I enable that and keep everything else the same in the config?

erogol commented 3 years ago

you can enable it and keep the rest the same.

It is disabled since the 2nd decoder is removed to reduce the model size.

BillyBobQuebec commented 3 years ago

Okay attempting that now, at what point would you recommend attempting inference and checking audio quality? total epochs is 1,000 epochs.

erogol commented 3 years ago

In general, after 20k steps, it should start producing understandable speech.

BillyBobQuebec commented 3 years ago

In general, after 20k steps, it should start producing understandable speech.

It's now at about 100K steps (Training with the pretrained model config, only change made is setting DDC to true.) Here is how it sounds at 20K, 60K, and 80K steps. Is your statement about producing understandable speech in reference to gradual training specifically perhaps? Because the pretrained config does not have gradual training and seems to be training slower than expected. Here is the tensorboard if that helps: https://tensorboard.dev/experiment/Ku5pmxY2QVWrasO3SQ7mBQ/#scalars

https://user-images.githubusercontent.com/74849975/126017564-6b2db751-fbc1-4305-8de3-6708559ad32c.mp4

https://user-images.githubusercontent.com/74849975/126017574-e913d899-8563-4df6-be4b-c5b903b24b91.mp4

https://user-images.githubusercontent.com/74849975/126017576-c61f9781-7a16-44d7-b0e8-53fdf7dcbca8.mp4

erogol commented 3 years ago

Why there is no image on the Tensorboard? there should be alignment images.

erogol commented 3 years ago

But looking at the audio samples, there looks to be something is broken in the model or the configs you use.

Also pls share the alignment images too, to see if the bug is in the inference or the training code.

BillyBobQuebec commented 3 years ago

Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?: tensorboard dev upload --logdir .

mbarnig commented 3 years ago

Since two weeks I have the same type of problems with Tacotron2-DDC Inference. My trained models with version 0.1.2 looks fine in Tensorboard and the audio in Tensorboard is intelligible, but the inference audio is broken. Until now I searched for errors in my settings, but the present issue description by Billy Bob makes me think that there is really a problem with inference. My understanding is that the models released in the past should work with the latest Coqui-TTS versions. Therefore I did some inference tests with the Tacotron2-DDC LJSpeech model, released in April 2021.

I used the following script

tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
--model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
--config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
--out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_version.wav 

and started with version 0.0.12 (git checkout a53958a). It works as expected.

Here are the logs, the signal-figure and the sound:

(coqui-tts) mbarnig@mbarnig-MS-7B22:~/coqui-tts/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.0.12.wav 
 > Downloading model to /home/mbarnig/.local/share/tts/tts_models--en--ljspeech--tacotron2-DDC
 > Downloading model to /home/mbarnig/.local/share/tts/vocoder_models--en--ljspeech--hifigan_v2
 > Using model: Tacotron2
 > Generator Model: hifigan_generator
Removing weight norm...
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Processing time: 2.3137736320495605
 > Real-time factor: 0.3193459475882124
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.0.12.wav

v0 0 12

https://user-images.githubusercontent.com/1360633/126529582-dfb4b79b-413a-4d72-a950-50b54d41544c.mp4

Version 0.0.13 (git checkout f02f033) works also fine.

In version 0.0.14 (git checkout 5482a0f) the following error is reported:

File "/home/mbarnig/coqui-tts/lib/python3.8/site-packages/coqpit/coqpit.py", line 856, in check_argument
    assert os.path.exists(c[name]), f' [!] path for {name} ("{c[name]}") does not exist.'
AssertionError:  [!] path for pad ("") does not exist.

I was not able to debug this problem and could not check if the inference is working.

Version 0.0.15 (git checkout b8b79a5) shows no errors in the logs, but the sound is bad.

v0 0 15

https://user-images.githubusercontent.com/1360633/126529666-d013c89c-a29d-4643-9fbc-2b116685e1cb.mp4

Same results for version 0.0.15.1 (git checkout d245b5d)

Versions 0.1.0 (git checkout c25a218), 0.1.1 (git checkout 676d22f), 0.1.2 (git checkout 8fbadad) and main show a warning that the decoder stopped with 'max_decoder_steps' 500. In all cases the sound is broken, as in version 0.0.15.

Here are the logs, the signal-figure and the sound for the latest version 0.1.2:

(coqui-tts) mbarnig@mbarnig-MS-7B22:~/coqui-tts/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-Training/LJSpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.1.2.wav 
 > Using model: Tacotron2
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 2.9582419395446777
 > Real-time factor: 0.47355409140841087
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/ljspeech_v0.1.2.wav

main

https://user-images.githubusercontent.com/1360633/126530413-160a6962-7ce5-4ddf-be43-62532bb54f89.mp4

I hope my report helps to solve the problem.

erogol commented 3 years ago

Never knew the tensorboard was supposed to show alignment images, is it supposed to show spectrogram images as well? I'm not sure where I would find those images at all. This is the tensorboard command used, does it seem correct to you?: tensorboard dev upload --logdir .

yes it should show all these.

Why don't use just run tensorboard locally ? Maybe uploading breaks things

erogol commented 3 years ago

@mbarnig very helpful !! Thx for going under the hood.

So it looks like we have something wrong after 0.15

I'll check and try to find that little 🐛

mbarnig commented 3 years ago

To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings:

GlowTTS LJSpeech

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/GlowTTS-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/GlowTTS-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
 > Using model: glow_tts
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 3.61332106590271
 > Real-time factor: 0.4497072242343694
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav

glowtts-v0 1 2

https://user-images.githubusercontent.com/1360633/126630757-181621f0-ecf3-4954-b56c-471993145ea0.mp4

Tacotron2-DCA LJSpeech

I changed the stats_path in the config-file to adapt to my environment.

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
 > Using model: Tacotron2
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 2.646209239959717
 > Real-time factor: 0.384968553659821
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav

tacotron2-dca-v0 1 2

https://user-images.githubusercontent.com/1360633/126630824-6d180ef0-a616-43b6-8464-d17e952fae5e.mp4

I think that this audio has also some problems, but I was not able to compare it with the the released version.

SpeedySpeech LJSpeech

I changed the stats_path in the config-file to adapt to my environment. The inference script

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/speedyspeech-ljspeech_v0.1.2.wav
 > Using model: speedy_speech
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/speedy_speech.py", line 310, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
    Missing key(s) in state_dict: "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.weight", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.bias", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.norm.weight",

..........................

"decoder.decoder.postnet.4.weight", "decoder.decoder.postnet.4.bias", "decoder.decoder.postnet.6.weight", "decoder.decoder.postnet.6.bias", "decoder.decoder.postnet.0.weight", "decoder.decoder.postnet.0.bias".

fails with a Missing key(s) in state_dict RuntimeError.

SC-GlowTTS VCTK

I changed the stats_path in the config-file to my environment and added the speaker_idx to the script. The inference

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/config.json \
> --speaker_idx p225 \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/sc-glowtts-vctk_v0.1.2.wav
 > Using model: glow_tts
 > Training with 0 speakers: 
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/glow_tts.py", line 386, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
    Missing key(s) in state_dict: "emb_g.weight". 
    Unexpected key(s) in state_dict: "decoder.flows.2.wn.cond_layer.bias", "decoder.flows.2.wn.cond_layer.weight_g", "decoder.flows.2.wn.cond_layer.weight_v", "decoder.flows.5.wn.cond_layer.bias", "decoder.flows.5.wn.cond_layer.weight_g", "decoder.flows.5.wn.cond_layer.weight_v", "decoder.flows.8.wn.cond_layer.bias", "decoder.flows.8.wn.cond_layer.weight_g", "decoder.flows.8.wn.cond_layer.weight_v", "decoder.flows.11.wn.cond_layer.bias", "decoder.flows.11.wn.cond_layer.weight_g", "decoder.flows.11.wn.cond_layer.weight_v", "decoder.flows.14.wn.cond_layer.bias", "decoder.flows.14.wn.cond_layer.weight_g", "decoder.flows.14.wn.cond_layer.weight_v", "decoder.flows.17.wn.cond_layer.bias", "decoder.flows.17.wn.cond_layer.weight_g", "decoder.flows.17.wn.cond_layer.weight_v", "decoder.flows.20.wn.cond_layer.bias", "decoder.flows.20.wn.cond_layer.weight_g", "decoder.flows.20.wn.cond_layer.weight_v", "decoder.flows.23.wn.cond_layer.bias", "decoder.flows.23.wn.cond_layer.weight_g", "decoder.flows.23.wn.cond_layer.weight_v", "decoder.flows.26.wn.cond_layer.bias", "decoder.flows.26.wn.cond_layer.weight_g", "decoder.flows.26.wn.cond_layer.weight_v", "decoder.flows.29.wn.cond_layer.bias", "decoder.flows.29.wn.cond_layer.weight_g", "decoder.flows.29.wn.cond_layer.weight_v", "decoder.flows.32.wn.cond_layer.bias", "decoder.flows.32.wn.cond_layer.weight_g", "decoder.flows.32.wn.cond_layer.weight_v", "decoder.flows.35.wn.cond_layer.bias", "decoder.flows.35.wn.cond_layer.weight_g", "decoder.flows.35.wn.cond_layer.weight_v". 
    size mismatch for encoder.duration_predictor.conv_1.weight: copying a param with shape torch.Size([256, 448, 3]) from checkpoint, the shape in current model is torch.Size([256, 192, 3]).

fails with Errors in loading state_dict RuntimeError.

Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors.

mbarnig commented 3 years ago

To complete my report I did some inference tests in version 0.1.2. with the other released english models. Here are my findings:

GlowTTS LJSpeech

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/GlowTTS-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/GlowTTS-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav
 > Using model: glow_tts
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 3.61332106590271
 > Real-time factor: 0.4497072242343694
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/glowtts-ljspeech_v0.1.2.wav

glowtts-v0 1 2

https://user-images.githubusercontent.com/1360633/126630757-181621f0-ecf3-4954-b56c-471993145ea0.mp4

Tacotron2-DCA LJSpeech

I changed the stats_path in the config-file to adapt to my environment.

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/Tacotron2-DCA-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav
 > Using model: Tacotron2
 > Text: The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
 > Text splitted to sentences.
['The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.']
 > Phonemes: ð|ə| n|ɔːɹ|θ| w|ɪ|n|d| æ|n|d| ð|ə| s|ʌ|n| w|ɜː| d|ɪ|s|p|j|uː|ɾ|ɪ|ŋ| w|ɪ|tʃ| w|ʌ|z| ð|ə| s|t|ɹ|ɔ|ŋ|ɡ|ɚ| ,| w|ɛ|n| ɐ| t|ɹ|æ|v|ə|l|ɚ| k|eɪ|m| ɐ|l|ɔ|ŋ| ɹ|æ|p|t| ɪ|n| ɐ| w|ɔːɹ|m| k|l|oʊ|k| .
 > Processing time: 2.646209239959717
 > Real-time factor: 0.384968553659821
 > Saving output to /home/mbarnig/myTTS-Project/audio/multilingual/tacotron2-dca-ljspeech_v0.1.2.wav

tacotron2-dca-v0 1 2

https://user-images.githubusercontent.com/1360633/126630824-6d180ef0-a616-43b6-8464-d17e952fae5e.mp4

I think that this audio has also some problems, but I was not able to compare it with the the released version.

SpeedySpeech LJSpeech

I changed the stats_path in the config-file to adapt to my environment. The inference script

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SpeedySpeech-release/config.json \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/speedyspeech-ljspeech_v0.1.2.wav
 > Using model: speedy_speech
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/speedy_speech.py", line 310, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
    Missing key(s) in state_dict: "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.weight", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.conv1d.bias", "decoder.decoder.res_conv_block.res_blocks.0.conv_bn_blocks.0.norm.weight",

..........................

"decoder.decoder.postnet.4.weight", "decoder.decoder.postnet.4.bias", "decoder.decoder.postnet.6.weight", "decoder.decoder.postnet.6.bias", "decoder.decoder.postnet.0.weight", "decoder.decoder.postnet.0.bias".

fails with a Missing key(s) in state_dict RuntimeError.

SC-GlowTTS VCTK

I changed the stats_path in the config-file to my environment and added the speaker_idx to the script. The inference

(recipe) mbarnig@mbarnig-MS-7B22:~/recipe/TTS$ tts --text "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." \
> --model_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/model_file.pth.tar \
> --config_path /home/mbarnig/myTTS-Project/SC-GlowTTS-VCTK-release/config.json \
> --speaker_idx p225 \
> --out_path /home/mbarnig/myTTS-Project/audio/multilingual/sc-glowtts-vctk_v0.1.2.wav
 > Using model: glow_tts
 > Training with 0 speakers: 
Traceback (most recent call last):
  File "/home/mbarnig/recipe/bin/tts", line 33, in <module>
    sys.exit(load_entry_point('TTS', 'console_scripts', 'tts')())
  File "/home/mbarnig/recipe/TTS/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 73, in __init__
    self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
  File "/home/mbarnig/recipe/TTS/TTS/utils/synthesizer.py", line 136, in _load_tts
    self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
  File "/home/mbarnig/recipe/TTS/TTS/tts/models/glow_tts.py", line 386, in load_checkpoint
    self.load_state_dict(state["model"])
  File "/home/mbarnig/recipe/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
    Missing key(s) in state_dict: "emb_g.weight". 
    Unexpected key(s) in state_dict: "decoder.flows.2.wn.cond_layer.bias", "decoder.flows.2.wn.cond_layer.weight_g", "decoder.flows.2.wn.cond_layer.weight_v", "decoder.flows.5.wn.cond_layer.bias", "decoder.flows.5.wn.cond_layer.weight_g", "decoder.flows.5.wn.cond_layer.weight_v", "decoder.flows.8.wn.cond_layer.bias", "decoder.flows.8.wn.cond_layer.weight_g", "decoder.flows.8.wn.cond_layer.weight_v", "decoder.flows.11.wn.cond_layer.bias", "decoder.flows.11.wn.cond_layer.weight_g", "decoder.flows.11.wn.cond_layer.weight_v", "decoder.flows.14.wn.cond_layer.bias", "decoder.flows.14.wn.cond_layer.weight_g", "decoder.flows.14.wn.cond_layer.weight_v", "decoder.flows.17.wn.cond_layer.bias", "decoder.flows.17.wn.cond_layer.weight_g", "decoder.flows.17.wn.cond_layer.weight_v", "decoder.flows.20.wn.cond_layer.bias", "decoder.flows.20.wn.cond_layer.weight_g", "decoder.flows.20.wn.cond_layer.weight_v", "decoder.flows.23.wn.cond_layer.bias", "decoder.flows.23.wn.cond_layer.weight_g", "decoder.flows.23.wn.cond_layer.weight_v", "decoder.flows.26.wn.cond_layer.bias", "decoder.flows.26.wn.cond_layer.weight_g", "decoder.flows.26.wn.cond_layer.weight_v", "decoder.flows.29.wn.cond_layer.bias", "decoder.flows.29.wn.cond_layer.weight_g", "decoder.flows.29.wn.cond_layer.weight_v", "decoder.flows.32.wn.cond_layer.bias", "decoder.flows.32.wn.cond_layer.weight_g", "decoder.flows.32.wn.cond_layer.weight_v", "decoder.flows.35.wn.cond_layer.bias", "decoder.flows.35.wn.cond_layer.weight_g", "decoder.flows.35.wn.cond_layer.weight_v". 
    size mismatch for encoder.duration_predictor.conv_1.weight: copying a param with shape torch.Size([256, 448, 3]) from checkpoint, the shape in current model is torch.Size([256, 192, 3]).

fails with Errors in loading state_dict RuntimeError.

Running SpeedySpeech and SC-GlowTTS-VCTK in earlier versions fails also, but with other errors.

BillyBobQuebec commented 3 years ago

yes it should show all these.

Why don't use just run tensorboard locally? Maybe uploading breaks things

I found alignment images and this is how it looks at 180K:

individualImage

also this is the test audio that I found as well for 187K

https://user-images.githubusercontent.com/74849975/126702399-b15fe92a-3ec4-4285-b03e-b4c06b7b59da.mp4

Definitely very different audio and finally understandable!

erogol commented 3 years ago

The alignment looks good enough. I guess the issue you experience is about a bug in the stopnet (which decides when to stop for the model). I am working on it and will release the fix soon. Until then, just keep training the model and after the release, you can continue training with the new version and fix the stopnet.

BillyBobQuebec commented 3 years ago

@erogol Thank you for pushing the new update! I'm currently training with the same config used at the beginning of this issue, except I'm now training using your new V0.1.3 to see if the stop net was the problem and if I can finally replicate the inference quality in the current pretrained voice.

BillyBobQuebec commented 3 years ago

Hello @erogol !

So I have made a few interesting observations lately regarding this issue:

When I inference my model that I trained with your pretrained config, along with the stop net fix, I definitely get significantly better results with it compared to the one I trained before the stop net fix, this difference is clear regardless of vocoder used, that being said...

When inferencing this better DDC model while using the pretrained hifigan as the vocoder, the inference quality is significantly worse than both multi-band melgan and griffin-lim, but from my experience, it should have significantly better quality than griffin-lim, and at least similar quality to multi-band melgan.

Here is audio of my model inferencing with Griffin-Lim, Multi-band Melgan, and finally hifi-gan.

https://user-images.githubusercontent.com/74849975/128803948-dd7caeb0-8235-4369-a47c-e434c2cbe453.mp4

https://user-images.githubusercontent.com/74849975/128803954-32696f70-20e4-4c07-8565-4eb9e32b84fa.mp4

Loudness warning

https://user-images.githubusercontent.com/74849975/128803957-45e4ad99-204c-4fb5-b620-99a8b1a37eb8.mp4

On the contrary, the pretrained DDC model that comes with coqui works perfectly well with hifigan, but seems to be significantly worse audio quality when inferenced with multi-band melgan, it seems that the opposite trend is occurring compared to the DDC model I trained, with the supposedly same config. Funnily enough, the high pitch artifacts and features of the audio seem similar to what happens when I use my own ddc checkpoint on hifigan, Here is audio of that poor inference quality occurring with the Coqui pretrained DCC + multi-band melgan:

https://user-images.githubusercontent.com/74849975/128804559-9278b90d-4f43-41e1-b0ce-de7235a10538.mp4

So overall, the stop net fix definitely helped, but there is something more occurring here unfortunately, that is preventing my Tacotron2DDC model from reaching the expected quality that Tacotron2DDC + Hifigan has to offer.

erogol commented 3 years ago

Thanks for the update.

Vocoder model should match the audio parameters of the tts model. Have you checked ?

BillyBobQuebec commented 3 years ago

Ah it seems like that may be the problem, looks like I was using a slightly different config, I also noticed some things that might've affected training. I'm going to train from scratch again and make sure everything is correct, i'll post an update of how inferencing sounds with hifi-gan within the next few days.

BillyBobQuebec commented 3 years ago

Ok, I double-checked this time that all audio parameters were the same across both Hifi-gan and Tacotron 2 DDC, also made sure I was using the new Stop net code, along with correctly calculating scale_stats.py which I had some mishaps with before. With all of this I trained a model with the config to about 80K steps and tried inferencing it again with Hifi-gan, unfortunately, that screeching high pitch feature is still strongly present, and it doesn't seem to have improved in that regard.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.