coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.28k stars 4.3k forks source link

[Bug] - unable to properly synthetize with Tacotron2-DDC with max_decoder_steps #1788

Closed FrischJulien closed 2 years ago

FrischJulien commented 2 years ago

Describe the bug

After training Tacotron2-DDC for about 140k iterations (batch size 30), I am unable to properly synthetize some speech, despite having some very decent audio samples from Eval Audio and Train Audio in the tensorboard. Whatever value I use for max_decoder_steps, I will always reach the limit during inference, and have a synthetized speach with barely a second of properly decoded audio. See the two examples (with max_decoder_steps set to 500 and 10000) below. exemple_max_decoder_steps.zip

To Reproduce

  1. Train Tacotron2-DDC for about 140k steps with the config attached

  2. Run inference through the code below:

model_path="/home/ec2-user/SageMaker/TTS/run-July-22-2022_06+26PM-c44e39d9/checkpoint_140000.pth" output_directory="/home/ec2-user/SageMaker/testouille/TTS_22khz_espeak_al100/" config_path="/home/ec2-user/SageMaker/TTS/run-July-22-2022_06+26PM-c44e39d9/config.json" config.txt

speaker="bernard" output_path=output_directory+"taco_22khz_espeak_al200_80k"+speaker+".wav" !cd ./TTS && python3 ./TTS/bin/synthesize.py \ --text "Les sanglots longs des violons de l'automne, blessent mon coeur d'une langueur monotone." \ --out_path $output_path \ --model_path $model_path \ --config_path $config_path \ --speaker_idx $speaker \ --use_cuda true

Expected behavior

Same level of audio quality that was displayed in Eval Audio and Train Audio in the tensorboard.

Logs

<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
 > Using model: tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:23
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Init speaker_embedding layer.
 > Model's reduction rate `r` is set to: 2
 > Using Griffin-Lim as no vocoder model defined
 > Text: Les sanglots longs des violons de l'automne, blessent mon coeur d'une langueur monotone.
 > Text splitted to sentences.
["Les sanglots longs des violons de l'automne, blessent mon coeur d'une langueur monotone."]
le- sɑ̃ɡlˈo lˈɔ̃ de- vjɔlˈɔ̃ də- lotˈɔn, blˈɛs mɔ̃ kˈœʁ dyn lɑ̃ɡˈœʁ mɔnɔtˈɔn.
 [!] Character '̃' not found in the vocabulary. Discarding it.
   > Decoder stopped with `max_decoder_steps` 500
 > Processing time: 4.731579542160034
 > Real-time factor: 0.39260088244561964
 > Saving output to /home/ec2-user/SageMaker/testouille/TTS_22khz_espeak_al100/taco_22khz_espeak_al100_short_140kbernard.wav

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4",
            "Tesla T4",
            "Tesla T4",
            "Tesla T4"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.0+cu102",
        "TTS": "0.6.2",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.7.12",
        "version": "#1 SMP Wed Mar 2 19:14:12 UTC 2022"
    }
}

Additional context

I am using distributed training

erogol commented 2 years ago

Where do you change max_decoder_steps

FrischJulien commented 2 years ago

In the config.json file after training, before inference. (see config.txt attached)

FrischJulien commented 2 years ago

@erogol did you get a chance to look into it, or did anyone face the same issue?

xettrisomeman commented 2 years ago

same error here but with Tacotron2-DCA, even i change the max_decoder_steps to 20k it still shows the error. tacotron_train.py

FrischJulien commented 2 years ago

@xettrisomeman How long did you train your model?

xettrisomeman commented 2 years ago

i get the error while training, i have not done inference.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

Marcophono2 commented 1 year ago

Is this issue fixed meanwhile? Anyhow? I cannot let run my training for longer than one or two hours. Then this problem comes up when the test sentences are (not successfully) generated.