[Bug] [v0.9.0] "Kernel size" error when using model "tts_models/fr/mai/tacotron2-DDC"

iwater commented 2 years ago

Describe the bug

tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

tts_models/fr/mai/tacotron2-DDC is already downloaded. vocoder_models/universal/libri-tts/fullband-melgan is already downloaded. Using model: Tacotron2 Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Model's reduction rate r is set to: 1 Vocoder Model: fullband_melgan Setting up Audio Processor... | > sample_rate:24000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Generator Model: fullband_melgan_generator Discriminator Model: melgan_multiscale_discriminator Text: autobus Text splitted to sentences. ['autobus'] Traceback (most recent call last): File "/home/iwater/miniconda3/bin/tts", line 8, in sys.exit(main()) File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/bin/synthesize.py", line 357, in main wav = synthesizer.tts( File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 279, in tts outputs = synthesis( File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 207, in synthesis outputs = run_model_torch( File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch outputs = _func( File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference encoder_outputs = self.encoder.inference(embedded_inputs) File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference o = layer(o) File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward o = self.convolution1d(x) File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 307, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

To Reproduce

tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

or

tts --text "chat" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

Expected behavior

Doesn't error

Logs

tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
 > tts_models/fr/mai/tacotron2-DDC is already downloaded.
 > vocoder_models/universal/libri-tts/fullband-melgan is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: fullband_melgan
 > Setting up Audio Processor...
 | > sample_rate:24000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Generator Model: fullband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: autobus
 > Text splitted to sentences.
['autobus']
Traceback (most recent call last):
  File "/home/iwater/miniconda3/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/bin/synthesize.py", line 357, in main
    wav = synthesizer.tts(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 279, in tts
    outputs = synthesis(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 207, in synthesis
    outputs = run_model_torch(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 307, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 303, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1080 Ti"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.0+cu102",
        "TTS": "0.9.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.12",
        "version": "#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022"
    }
}

Additional context

No response

p0p4k commented 2 years ago

Hello, can you try a longer input text and report back?

iwater commented 2 years ago

input: Il n’y aura jamais trop de dindes ou de cornes d’abondance à Thanksgiving. no error in stdout, but no words in the output wav file

sox --i speech.wav 

Input File     : 'speech.wav'
Channels       : 1
Sample Rate    : 24000
Precision      : 16-bit
Duration       : 00:00:00.49 = 11792 samples ~ 36.85 CDDA sectors
File Size      : 23.6k
Bit Rate       : 385k
Sample Encoding: 16-bit Signed Integer PCM

input: chat. no error ouput, but no words in the output wav file

input: chat error as before

input: Il n’y aura jamais trop de dindes ou de cornes d’abondance à Thanksgiving error as before

Summary: Without the ending punctuation, a "Kernel size" error will occur, and with the ending punctuation, an empty wav file will be output

p0p4k commented 2 years ago

@iwater I tried "Il n’y aura jamais trop de dindes ou de cornes d’abondance à Thanksgiving" input and got the following output. http://sndup.net/s8td Even the smaller word with no "" works for me.

iwater commented 2 years ago

I retry with a clean env, also got "Kernel size" error

conda create -n tts python=3.9
conda activate tts
pip install TTS
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1080 Ti"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.0+cu117",
        "TTS": "0.9.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.9.15",
        "version": "#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022"
    }
}

p0p4k commented 2 years ago

Ok, I'll give it another try with a fresh install later today and get back to you.

p0p4k commented 1 year ago

I just tried on my system and it works fine.

iwater commented 1 year ago

python 3.7 result same as python 3.9

conda create -n tts python=3.7
conda activate tts
pip install TTS
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav

tts_models/fr/mai/tacotron2-DDC is already downloaded. vocoder_models/universal/libri-tts/fullband-melgan is already downloaded. Using model: Tacotron2 Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Model's reduction rate r is set to: 1 Vocoder Model: fullband_melgan Setting up Audio Processor... | > sample_rate:24000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Generator Model: fullband_melgan_generator Discriminator Model: melgan_multiscale_discriminator Text: autobus Text splitted to sentences. ['autobus'] Traceback (most recent call last): File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in sys.exit(main()) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main reference_speaker_name=args.reference_speaker_idx, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts language_id=language_id, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis language_id=language_id, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch "language_ids": language_id, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference encoder_outputs = self.encoder.inference(embedded_inputs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference o = layer(o) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward o = self.convolution1d(x) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

Hi, can you go to the downloaded models folder (probably in /home/iwater/.local/share/tts/) and delete the models. And retry the code. Thanks.

iwater commented 1 year ago

delete models from cache and download again, same error

$ tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
 > Downloading model to /home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 575M/575M [01:21<00:00, 7.04MiB/s]
 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Downloading model to /home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109M/109M [00:13<00:00, 8.12MiB/s]
 > Model's license - MPL
 > Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: fullband_melgan
 > Setting up Audio Processor...
 | > sample_rate:24000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Generator Model: fullband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: autobus
 > Text splitted to sentences.
['autobus']
Traceback (most recent call last):
  File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main
    reference_speaker_name=args.reference_speaker_idx,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
    "language_ids": language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

Hi, go to /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py Open that file and add this at line 39. Make sure you are in the right forward method and tell me the output.

    def forward(self, x):
        print(f' debug pad {self.convolution1d.padding}')
        print(f' debug x  {x.shape}' )
        print(f' debug weight {self.convolution1d.weight.shape}')
        print(f' debug kernel  {self.convolution1d.kernel_size}')
        o = self.convolution1d(x)
        o = self.batch_normalization(o)
        o = self.activation(o)
        o = self.dropout(o)
        return o

iwater commented 1 year ago

['autobus']
 debug pad (2,)
 debug x  torch.Size([1, 512, 0])
 debug weight torch.Size([512, 512, 5])
 debug kernel  (5,)
Traceback (most recent call last):
  File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main
    reference_speaker_name=args.reference_speaker_idx,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
    "language_ids": language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

Hi, now in your /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py file add this. Make sure same function. (around L239). And report the output. ('.....' means leave everything below as it is)

    @torch.no_grad()
    def inference(self, text, aux_input=None):
        """Forward pass for inference with no Teacher-Forcing.

        Shapes:
           text: :math:`[B, T_in]`
           text_lengths: :math:`[B]`
        """
        print(f' debug text{text}')
        aux_input = self._format_aux_input(aux_input)
        print(f' debug aux_input {aux_input}')
        embedded_inputs = self.embedding(text).transpose(1, 2)
        print(f' debug embedded_inputs {embedded_inputs}')
        encoder_outputs = self.encoder.inference(embedded_inputs)

        if self.gst and self.use_gst:
        .....

iwater commented 1 year ago

['autobus']
 debug kernel  tensor([], size=(1, 0), dtype=torch.int64)
 debug kernel  {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
 debug kernel  tensor([], size=(1, 512, 0))
 debug pad (2,)
 debug x  torch.Size([1, 512, 0])
 debug weight torch.Size([512, 512, 5])
 debug kernel  (5,)
Traceback (most recent call last):
  File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main
    reference_speaker_name=args.reference_speaker_idx,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
    "language_ids": language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

@iwater , so what we can see is for some reason your input text is getting converted to a tensor of size 0. We need to check what is going on with your text input.

p0p4k commented 1 year ago

In /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py this file add the following print line and report back,

    # kick it
    print(f'debug text {args.text}')
    wav = synthesizer.tts(
        args.text,
        args.speaker_idx,
        args.language_idx,
        args.speaker_wav,
        reference_wav=args.reference_wav,
        style_wav=args.capacitron_style_wav,
        style_text=args.capacitron_style_text,
        reference_speaker_name=args.reference_speaker_idx,

iwater commented 1 year ago

 > Generator Model: fullband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: autobus
debug text autobus
 > Text splitted to sentences.
['autobus']
 debug kernel  tensor([], size=(1, 0), dtype=torch.int64)
 debug kernel  {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
 debug kernel  tensor([], size=(1, 512, 0))
 debug pad (2,)
 debug x  torch.Size([1, 512, 0])
 debug weight torch.Size([512, 512, 5])
 debug kernel  (5,)
Traceback (most recent call last):
  File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 366, in main
    reference_speaker_name=args.reference_speaker_idx,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
    "language_ids": language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py around L202~ add this (before # synthesize voice line..

    print(f'debug text input 1 {text_inputs}')
    text_inputs = numpy_to_torch(text_inputs, torch.long, cuda=use_cuda)
    print(f'debug text input 2 {text_inputs}')
    text_inputs = text_inputs.unsqueeze(0)
    print(f'debug text input 3 {text_inputs}')

    # synthesize voice
    outputs = run_model_torch(

iwater commented 1 year ago

> Text splitted to sentences.
['autobus']
debug text input 1 []
debug text input 2 tensor([], dtype=torch.int64)
debug text input 3 tensor([], size=(1, 0), dtype=torch.int64)
 debug kernel  tensor([], size=(1, 0), dtype=torch.int64)
 debug kernel  {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
 debug kernel  tensor([], size=(1, 512, 0))
 debug pad (2,)
 debug x  torch.Size([1, 512, 0])
 debug weight torch.Size([512, 512, 5])
 debug kernel  (5,)
Traceback (most recent call last):
  File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 366, in main
    reference_speaker_name=args.reference_speaker_idx,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 217, in synthesis
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
    "language_ids": language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

what about the following lines? /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py We are close to finding the problem source.

    # convert text to sequence of token IDs
    print(f'debug tokenizer {model.tokenizer.__dict__}')
    print(f'debug text into tokenizer {text}')
    text_inputs = np.asarray(
        model.tokenizer.text_to_ids(text, language=language_id),
        dtype=np.int32,
    )
    print(f'debug text out from tokenizer {text_inputs}')

iwater commented 1 year ago

 > Generator Model: fullband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: autobus
debug text autobus
 > Text splitted to sentences.
['autobus']
debug tokenizer {'text_cleaner': <function phoneme_cleaners at 0x7fd7f34cfc20>, 'use_phonemes': True, 'add_blank': False, 'use_eos_bos': False, '_characters': <TTS.tts.utils.text.characters.IPAPhonemes object at 0x7fd7e6ae8110>, 'pad_id': 0, 'blank_id': None, 'not_found_characters': [], 'phonemizer': <TTS.tts.utils.text.phonemizers.gruut_wrapper.Gruut object at 0x7fd7fff0ce10>}
debug text into tokenizer autobus
debug text input 1 []
debug text input 2 tensor([], dtype=torch.int64)
debug text input 3 tensor([], size=(1, 0), dtype=torch.int64)
 debug kernel  tensor([], size=(1, 0), dtype=torch.int64)
 debug kernel  {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
 debug kernel  tensor([], size=(1, 512, 0))
 debug pad (2,)
 debug x  torch.Size([1, 512, 0])
 debug weight torch.Size([512, 512, 5])
 debug kernel  (5,)
Traceback (most recent call last):
  File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 366, in main
    reference_speaker_name=args.reference_speaker_idx,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 219, in synthesis
    language_id=language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
    "language_ids": language_id,
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
    o = layer(o)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
    o = self.convolution1d(x)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

p0p4k commented 1 year ago

what about text out from tokenizer? (last line previous comment)

p0p4k commented 1 year ago

So, I will assume that the text_ids (text -> token ids) coming out from tokenizer is corrupted (0 size). If you see this file (/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/text/tokenizer.py): The text goes through following steps to get converted into tokens. We can modify the lines as follows to debug again:

        if self.text_cleaner is not None:
            text = self.text_cleaner(text)
            print(f"debug text-inside-tokenizer 1 {text}")
        if self.use_phonemes:
            text = self.phonemizer.phonemize(text, separator="")
            print(f"debug text-inside-tokenizer 2 {text}")
        if self.add_blank:
            text = self.intersperse_blank_char(text, True)
            print(f"debug text-inside-tokenizer 3 {text}")
        if self.use_eos_bos:
            text = self.pad_with_bos_eos(text)
            print(f"debug text-inside-tokenizer 4 {text}")
        return self.encode(text)

At the same time, we can also check if there is issue in the encode function.

    def encode(self, text: str) -> List[int]:
        """Encodes a string of text as a sequence of IDs."""
        token_ids = []
        for char in text:
            try:
                idx = self.characters.char_to_id(char)
                print(f'debug toekn_idx_encode {idx}')
                token_ids.append(idx)
            except KeyError:
                # discard but store not found characters
                if char not in self.not_found_characters:
                    self.not_found_characters.append(char)
                    print(text)
                    print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
        return token_ids

p0p4k commented 1 year ago

On further inspection, it is quite possible that the phonemizer might be broken in your install. If thats the case, then can you try uninstalling gruut and related packages using pip and re-installing them? Below is the screenshot of the gruut packages I am using. Find the corresponding packages (probably in pypi) and install the right versions.

iwater commented 1 year ago

yes, after install gruut-lang-fr, everything works fine, thanks

p0p4k commented 1 year ago

HAHAHHAHAHHAHHAHAA

p0p4k commented 1 year ago

@erogol can close.

iwater commented 1 year ago

BTW，pip install tts only install these packages

gruut                    2.2.3
gruut-ipa                0.13.0
gruut-lang-de            2.0.0
gruut-lang-en            2.0.0

coqui-ai / TTS