coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.85k stars 4.24k forks source link

[Bug] speedy_speech don't work #2113

Closed ikm565 closed 1 year ago

ikm565 commented 1 year ago

Describe the bug

run tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/speedy-speech" --out_path output.wav then you will get an error: "RuntimeError: Calculated padded input size per channel: (11). Kernel size: (13). Kernel size can't be greater than actual input size"

To Reproduce

run tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/speedy-speech" --out_path output.wav then you will get an error: "RuntimeError: Calculated padded input size per channel: (11). Kernel size: (13). Kernel size can't be greater than actual input size"

Expected behavior

No response

Logs

tts --text "Text for TTS" --model_name "tts_models/en/ljspeech/speedy-speech" --out_path output.wav

 > tts_models/en/ljspeech/speedy-speech is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.

 > Using model: speedy_speech
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Vocoder Model: hifigan
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: Text for TTS
 > Text splitted to sentences.
['Text for TTS']
Traceback (most recent call last):
  File "/public/liuchang/software/anaconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/bin/synthesize.py", line 350, in main
    wav = synthesizer.tts(
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/utils/synthesizer.py", line 270, in tts
    outputs = synthesis(
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/utils/synthesis.py", line 207, in synthesis
    outputs = run_model_torch(
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/models/forward_tts.py", line 596, in inference
    o_en, x_mask, g, _ = self._forward_encoder(x, x_mask, g)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/models/forward_tts.py", line 363, in _forward_encoder
    o_en = self.encoder(torch.transpose(x_emb, 1, -1), x_mask)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/layers/feed_forward/encoder.py", line 161, in forward
    o = self.encoder(x, x_mask)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/layers/feed_forward/encoder.py", line 71, in forward
    o = self.res_conv_block(o, x_mask)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/layers/generic/res_conv_bn.py", line 124, in forward
    o = block(o)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/layers/generic/res_conv_bn.py", line 79, in forward
    return self.conv_bn_blocks(x)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/experiment/voice-clone/TTS-dev/TTS/tts/layers/generic/res_conv_bn.py", line 42, in forward
    o = self.conv1d(x)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/public/liuchang/software/anaconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (11). Kernel size: (13). Kernel size can't be greater than actual input size

Environment

all recent version

Additional context

No response

erogol commented 1 year ago

your input is too short. Shorter than the first layer's kernel size.

We can try to pad samples that are shorter than 13 chars after phoneme conversion.

@loganhart420 can you give it a look?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

loganhart02 commented 1 year ago

your input is too short. Shorter than the first layer's kernel size.

We can try to pad samples that are shorter than 13 chars after phoneme conversion.

@loganhart420 can you give it a look?

this got lost in my notifications. I'm checking on this now

erogol commented 1 year ago

@loganhart420 any updates?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.