coqui-ai / TTS

πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.78k stars 4.38k forks source link

[Bug] [v0.6.1] "Kernel size" error when using model "tts_models/zh-CN/baker/tacotron2-DDC-GST" #1398

Closed fijipants closed 2 years ago

fijipants commented 2 years ago

πŸ› Description

Full log ``` $ tts --model_name "tts_models/zh-CN/baker/tacotron2-DDC-GST" --text "hello" --out_path "test.wav" > Downloading model to /home/fijipants/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST > Using model: tacotron2 > Setting up Audio Processor... | > sample_rate:22050 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/fijipants/.local/share/tts/tts_models--zh-CN--baker--tacotron2-DDC-GST/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 > Model's reduction rate `r` is set to: 2 > Using Griffin-Lim as no vocoder model defined > Text: hello > Text splitted to sentences. ['hello'] Traceback (most recent call last): File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/bin/tts", line 8, in sys.exit(main()) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 287, in main wav = synthesizer.tts(args.text, args.speaker_idx, args.language_idx, args.speaker_wav) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 260, in tts d_vector=speaker_embedding, File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 184, in synthesis outputs = run_model_torch(model, text_inputs, speaker_id, style_mel, d_vector=d_vector, language_id=language_id) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 56, in run_model_torch "language_ids": language_id, File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 214, in inference encoder_outputs = self.encoder.inference(embedded_inputs) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference o = layer(o) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward o = self.convolution1d(x) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 302, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/fijipants/miniconda3/envs/coqui-pip-0.6.1/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 299, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size ```

To Reproduce

tts --model_name "tts_models/zh-CN/baker/tacotron2-DDC-GST" --text "hello" --out_path "test.wav"

Expected behavior

Doesn't error

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu102",
        "TTS": "0.6.1",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.7.11",
        "version": "#202202230823 SMP PREEMPT Wed Feb 23 14:53:24 UTC 2022"
    }
}

Additional context

Might be related to #1381

WeberJulian commented 2 years ago

Thanks for the report, I'm working on a fix. In the mean time you can set "phonemizer": "zh_cn_phonemizer" in the config file, and it shoud work.

tts --model_name "tts_models/zh-CN/baker/tacotron2-DDC-GST" --text "δ½ ε₯½δΈ–η•Œγ€‚" --out_path "test.wav" works on my config.

https://drive.google.com/file/d/1K1cVww-zzQDHTT1vunz_Mj9t02FNipTB/view?usp=sharing

WeberJulian commented 2 years ago

Here is the PR: https://github.com/coqui-ai/TTS/pull/1399

erogol commented 2 years ago

I close this, reopen if the issue still exists.