Closed iwater closed 1 year ago
Hello, can you try a longer input text and report back?
input: Il nβy aura jamais trop de dindes ou de cornes dβabondance Γ Thanksgiving. no error in stdout, but no words in the output wav file
sox --i speech.wav
Input File : 'speech.wav'
Channels : 1
Sample Rate : 24000
Precision : 16-bit
Duration : 00:00:00.49 = 11792 samples ~ 36.85 CDDA sectors
File Size : 23.6k
Bit Rate : 385k
Sample Encoding: 16-bit Signed Integer PCM
input: chat. no error ouput, but no words in the output wav file
input: chat error as before
input: Il nβy aura jamais trop de dindes ou de cornes dβabondance Γ Thanksgiving error as before
Summary: Without the ending punctuation, a "Kernel size" error will occur, and with the ending punctuation, an empty wav file will be output
@iwater I tried "Il nβy aura jamais trop de dindes ou de cornes dβabondance Γ Thanksgiving" input and got the following output.
http://sndup.net/s8td
Even the smaller word with no "
I retry with a clean env, also got "Kernel size" error
conda create -n tts python=3.9
conda activate tts
pip install TTS
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
{
"CUDA": {
"GPU": [
"NVIDIA GeForce GTX 1080 Ti"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.13.0+cu117",
"TTS": "0.9.0",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.9.15",
"version": "#58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022"
}
}
Ok, I'll give it another try with a fresh install later today and get back to you.
I just tried on my system and it works fine.
python 3.7 result same as python 3.9
conda create -n tts python=3.7
conda activate tts
pip install TTS
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
tts_models/fr/mai/tacotron2-DDC is already downloaded. vocoder_models/universal/libri-tts/fullband-melgan is already downloaded. Using model: Tacotron2 Setting up Audio Processor... | > sample_rate:16000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:20 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Model's reduction rate
r
is set to: 1 Vocoder Model: fullband_melgan Setting up Audio Processor... | > sample_rate:24000 | > resample:False | > num_mels:80 | > log_func:np.log10 | > min_level_db:-100 | > frame_shift_ms:None | > frame_length_ms:None | > ref_level_db:0 | > fft_size:1024 | > power:1.5 | > preemphasis:0.0 | > griffin_lim_iters:60 | > signal_norm:True | > symmetric_norm:True | > mel_fmin:50.0 | > mel_fmax:7600.0 | > pitch_fmin:0.0 | > pitch_fmax:640.0 | > spec_gain:1.0 | > stft_pad_mode:reflect | > max_norm:4.0 | > clip_norm:True | > do_trim_silence:True | > trim_db:60 | > do_sound_norm:False | > do_amp_to_db_linear:True | > do_amp_to_db_mel:True | > do_rms_norm:False | > db_level:None | > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy | > base:10 | > hop_length:256 | > win_length:1024 Generator Model: fullband_melgan_generator Discriminator Model: melgan_multiscale_discriminator Text: autobus Text splitted to sentences. ['autobus'] Traceback (most recent call last): File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, insys.exit(main()) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main reference_speaker_name=args.reference_speaker_idx, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts language_id=language_id, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis language_id=language_id, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch "language_ids": language_id, File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, kwargs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference encoder_outputs = self.encoder.inference(embedded_inputs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference o = layer(o) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, *kwargs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward o = self.convolution1d(x) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(input, kwargs) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
Hi, can you go to the downloaded models folder (probably in /home/iwater/.local/share/tts/) and delete the models. And retry the code. Thanks.
delete models from cache and download again, same error
$ tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
> Downloading model to /home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 575M/575M [01:21<00:00, 7.04MiB/s]
> Model's license - MPL
> Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
> Downloading model to /home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 109M/109M [00:13<00:00, 8.12MiB/s]
> Model's license - MPL
> Check https://www.mozilla.org/en-US/MPL/2.0/ for more info.
> Using model: Tacotron2
> Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > pitch_fmin:0.0
| > pitch_fmax:640.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:/home/iwater/.local/share/tts/tts_models--fr--mai--tacotron2-DDC/scale_stats.npy
| > base:10
| > hop_length:256
| > win_length:1024
> Model's reduction rate `r` is set to: 1
> Vocoder Model: fullband_melgan
> Setting up Audio Processor...
| > sample_rate:24000
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:0
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > pitch_fmin:0.0
| > pitch_fmax:640.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:/home/iwater/.local/share/tts/vocoder_models--universal--libri-tts--fullband-melgan/scale_stats.npy
| > base:10
| > hop_length:256
| > win_length:1024
> Generator Model: fullband_melgan_generator
> Discriminator Model: melgan_multiscale_discriminator
> Text: autobus
> Text splitted to sentences.
['autobus']
Traceback (most recent call last):
File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main
reference_speaker_name=args.reference_speaker_idx,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 108, in inference
o = layer(o)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 40, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
Hi, go to /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py Open that file and add this at line 39. Make sure you are in the right forward method and tell me the output.
def forward(self, x):
print(f' debug pad {self.convolution1d.padding}')
print(f' debug x {x.shape}' )
print(f' debug weight {self.convolution1d.weight.shape}')
print(f' debug kernel {self.convolution1d.kernel_size}')
o = self.convolution1d(x)
o = self.batch_normalization(o)
o = self.activation(o)
o = self.dropout(o)
return o
['autobus']
debug pad (2,)
debug x torch.Size([1, 512, 0])
debug weight torch.Size([512, 512, 5])
debug kernel (5,)
Traceback (most recent call last):
File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main
reference_speaker_name=args.reference_speaker_idx,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 249, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
o = layer(o)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
Hi, now in your /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py file add this. Make sure same function. (around L239). And report the output. ('.....' means leave everything below as it is)
@torch.no_grad()
def inference(self, text, aux_input=None):
"""Forward pass for inference with no Teacher-Forcing.
Shapes:
text: :math:`[B, T_in]`
text_lengths: :math:`[B]`
"""
print(f' debug text{text}')
aux_input = self._format_aux_input(aux_input)
print(f' debug aux_input {aux_input}')
embedded_inputs = self.embedding(text).transpose(1, 2)
print(f' debug embedded_inputs {embedded_inputs}')
encoder_outputs = self.encoder.inference(embedded_inputs)
if self.gst and self.use_gst:
.....
['autobus']
debug kernel tensor([], size=(1, 0), dtype=torch.int64)
debug kernel {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
debug kernel tensor([], size=(1, 512, 0))
debug pad (2,)
debug x torch.Size([1, 512, 0])
debug weight torch.Size([512, 512, 5])
debug kernel (5,)
Traceback (most recent call last):
File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 365, in main
reference_speaker_name=args.reference_speaker_idx,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
o = layer(o)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
@iwater , so what we can see is for some reason your input text is getting converted to a tensor of size 0. We need to check what is going on with your text input.
In /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py this file add the following print line and report back,
# kick it
print(f'debug text {args.text}')
wav = synthesizer.tts(
args.text,
args.speaker_idx,
args.language_idx,
args.speaker_wav,
reference_wav=args.reference_wav,
style_wav=args.capacitron_style_wav,
style_text=args.capacitron_style_text,
reference_speaker_name=args.reference_speaker_idx,
> Generator Model: fullband_melgan_generator
> Discriminator Model: melgan_multiscale_discriminator
> Text: autobus
debug text autobus
> Text splitted to sentences.
['autobus']
debug kernel tensor([], size=(1, 0), dtype=torch.int64)
debug kernel {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
debug kernel tensor([], size=(1, 512, 0))
debug pad (2,)
debug x torch.Size([1, 512, 0])
debug weight torch.Size([512, 512, 5])
debug kernel (5,)
Traceback (most recent call last):
File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 366, in main
reference_speaker_name=args.reference_speaker_idx,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 214, in synthesis
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
o = layer(o)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py
around L202~ add this (before # synthesize voice
line..
print(f'debug text input 1 {text_inputs}')
text_inputs = numpy_to_torch(text_inputs, torch.long, cuda=use_cuda)
print(f'debug text input 2 {text_inputs}')
text_inputs = text_inputs.unsqueeze(0)
print(f'debug text input 3 {text_inputs}')
# synthesize voice
outputs = run_model_torch(
> Text splitted to sentences.
['autobus']
debug text input 1 []
debug text input 2 tensor([], dtype=torch.int64)
debug text input 3 tensor([], size=(1, 0), dtype=torch.int64)
debug kernel tensor([], size=(1, 0), dtype=torch.int64)
debug kernel {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
debug kernel tensor([], size=(1, 512, 0))
debug pad (2,)
debug x torch.Size([1, 512, 0])
debug weight torch.Size([512, 512, 5])
debug kernel (5,)
Traceback (most recent call last):
File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 366, in main
reference_speaker_name=args.reference_speaker_idx,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 217, in synthesis
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
o = layer(o)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
what about the following lines? /home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py We are close to finding the problem source.
# convert text to sequence of token IDs
print(f'debug tokenizer {model.tokenizer.__dict__}')
print(f'debug text into tokenizer {text}')
text_inputs = np.asarray(
model.tokenizer.text_to_ids(text, language=language_id),
dtype=np.int32,
)
print(f'debug text out from tokenizer {text_inputs}')
> Generator Model: fullband_melgan_generator
> Discriminator Model: melgan_multiscale_discriminator
> Text: autobus
debug text autobus
> Text splitted to sentences.
['autobus']
debug tokenizer {'text_cleaner': <function phoneme_cleaners at 0x7fd7f34cfc20>, 'use_phonemes': True, 'add_blank': False, 'use_eos_bos': False, '_characters': <TTS.tts.utils.text.characters.IPAPhonemes object at 0x7fd7e6ae8110>, 'pad_id': 0, 'blank_id': None, 'not_found_characters': [], 'phonemizer': <TTS.tts.utils.text.phonemizers.gruut_wrapper.Gruut object at 0x7fd7fff0ce10>}
debug text into tokenizer autobus
debug text input 1 []
debug text input 2 tensor([], dtype=torch.int64)
debug text input 3 tensor([], size=(1, 0), dtype=torch.int64)
debug kernel tensor([], size=(1, 0), dtype=torch.int64)
debug kernel {'x_lengths': tensor([0]), 'speaker_ids': None, 'd_vectors': None, 'style_mel': None, 'style_text': None, 'language_ids': None}
debug kernel tensor([], size=(1, 512, 0))
debug pad (2,)
debug x torch.Size([1, 512, 0])
debug weight torch.Size([512, 512, 5])
debug kernel (5,)
Traceback (most recent call last):
File "/home/iwater/miniconda3/envs/tts/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 366, in main
reference_speaker_name=args.reference_speaker_idx,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 289, in tts
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 219, in synthesis
language_id=language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/synthesis.py", line 58, in run_model_torch
"language_ids": language_id,
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/models/tacotron2.py", line 252, in inference
encoder_outputs = self.encoder.inference(embedded_inputs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 112, in inference
o = layer(o)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/layers/tacotron/tacotron2.py", line 44, in forward
o = self.convolution1d(x)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 310, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size
what about text out from tokenizer? (last line previous comment)
So, I will assume that the text_ids (text -> token ids) coming out from tokenizer is corrupted (0 size). If you see this file (/home/iwater/miniconda3/envs/tts/lib/python3.7/site-packages/TTS/tts/utils/text/tokenizer.py): The text goes through following steps to get converted into tokens. We can modify the lines as follows to debug again:
if self.text_cleaner is not None:
text = self.text_cleaner(text)
print(f"debug text-inside-tokenizer 1 {text}")
if self.use_phonemes:
text = self.phonemizer.phonemize(text, separator="")
print(f"debug text-inside-tokenizer 2 {text}")
if self.add_blank:
text = self.intersperse_blank_char(text, True)
print(f"debug text-inside-tokenizer 3 {text}")
if self.use_eos_bos:
text = self.pad_with_bos_eos(text)
print(f"debug text-inside-tokenizer 4 {text}")
return self.encode(text)
At the same time, we can also check if there is issue in the encode
function.
def encode(self, text: str) -> List[int]:
"""Encodes a string of text as a sequence of IDs."""
token_ids = []
for char in text:
try:
idx = self.characters.char_to_id(char)
print(f'debug toekn_idx_encode {idx}')
token_ids.append(idx)
except KeyError:
# discard but store not found characters
if char not in self.not_found_characters:
self.not_found_characters.append(char)
print(text)
print(f" [!] Character {repr(char)} not found in the vocabulary. Discarding it.")
return token_ids
On further inspection, it is quite possible that the phonemizer might be broken in your install. If thats the case, then can you try uninstalling gruut and related packages using pip and re-installing them? Below is the screenshot of the gruut packages I am using. Find the corresponding packages (probably in pypi) and install the right versions.
yes, after install gruut-lang-fr, everything works fine, thanks
HAHAHHAHAHHAHHAHAA
@erogol can close.
BTWοΌpip install tts only install these packages
gruut 2.2.3
gruut-ipa 0.13.0
gruut-lang-de 2.0.0
gruut-lang-en 2.0.0
Describe the bug
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
To Reproduce
tts --text "autobus" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
or
tts --text "chat" --model_name tts_models/fr/mai/tacotron2-DDC --out_path speech.wav
Expected behavior
Doesn't error
Logs
Environment
Additional context
No response