angelhd1999 commented 1 year ago

Describe the bug

When using the model: tts_models/es/mai/tacotron2-DDC And using as example the phrase: "Los flamencos son aves gregarias altamente especializadas, que habitan sistemas salinos de donde obtienen su alimento." (But it happens with any phrase) A wrong audio of less than 1 second of duration is obtained:

https://user-images.githubusercontent.com/51427052/211700222-49bf2b87-711e-4bad-afd5-832cfceae30c.mp4

To Reproduce

I tried the three options I found on the documentation.

Installing for Windows all the neccessary (following the recommended by the documentation) and run the following code using the API:
```
from TTS.api import TTS
```

Running a single speaker model

Spanish Model 21: tts_models--es--mai--tacotron2-DDC ! Not working

Spanish Model 22: tts_models--es--css10--vits

model_name = TTS.list_models()[22]

Init TTS with the target model name

tts = TTS(model_name=model_name, progress_bar=False, gpu=False)

Run TTS

tts.tts_to_file(text="Los flamencos son aves gregarias altamente especializadas, que habitan sistemas salinos de donde obtienen su alimento.", file_path="test.wav")

2. Using the **command line**:

tts --text "Los flamencos son aves gregarias altamente especializadas, que habitan sistemas salinos de donde obtienen su alimento." --model_name "tts_models/es/mai/tacotron2-DDC" --out_path testing.wav

3. Using the **Docker server**:

docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu python TTS/server/server.py --model_name tts_models/es/mai/tacotron2-DDC

It's not working either at: https://huggingface.co/spaces/coqui/CoquiTTS when selecting **tts_models/es/mai/tacotron2-DDC**.

### Expected behavior

When using **tts_models/es/css10/vits** you can get, for example:

https://user-images.githubusercontent.com/51427052/211701564-5781a35d-3200-4cf7-bcbb-58dc4ed4c05c.mp4

The problem is that this is a male voice and I think the **audio_es_mai_tacotron2_ddc** was a **female** voice, which is what I need.

### Logs

```shell
Just in case they are useful:

 > Using model: Tacotron2
C:\Users\angel\.conda\envs\aivt\lib\site-packages\torchaudio\extension\extension.py:13: UserWarning: torchaudio C++ extension is not available.
  warnings.warn('torchaudio C++ extension is not available.')
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:C:\Users\angel\AppData\Local\tts\tts_models--es--mai--tacotron2-DDC\scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: fullband_melgan
 > Setting up Audio Processor...
 | > sample_rate:24000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:0
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:50.0
 | > mel_fmax:7600.0
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > db_level:None
 | > stats_path:C:\Users\angel\AppData\Local\tts\vocoder_models--universal--libri-tts--fullband-melgan\scale_stats.npy
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Generator Model: fullband_melgan_generator
 > Discriminator Model: melgan_multiscale_discriminator
 > Text: Los flamencos son aves gregarias altamente especializadas, que habitan sistemas salinos de donde obtienen su alimento.
 > Text splitted to sentences.
['Los flamencos son aves gregarias altamente especializadas, que habitan sistemas salinos de donde obtienen su alimento.']
 > interpolating tts model output.
 > before interpolation : (80, 3)
 > after interpolation : torch.Size([1, 80, 4])
 > Processing time: 0.20200014114379883
 > Real-time factor: 0.26826047960663857
 > Saving output to testing.wav

Environment

Everything has been installed following the documentation.
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1080"
        ],
        "available": true,
        "version": "10.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.8.0+cu101",
        "TTS": "0.10.1",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Windows",
        "architecture": [
            "64bit",
            "WindowsPE"
        ],
        "processor": "Intel64 Family 6 Model 158 Stepping 12, GenuineIntel",
        "python": "3.9.15",
        "version": "10.0.19045"
    }
}

Additional context

If there's an option of using the male voice of tts_models/es/css10/vits and transform it into a female voice it could be also an interesting solution.

Edresson commented 1 year ago

Hi, here this model is working using the dev branch.

Please install the :frog: TTS from the dev branch using the command: pip install git+https://github.com/coqui-ai/TTS.git and try again.

For me, the command tts --model_name tts_models/es/mai/tacotron2-DDC --text "Los flamencos son aves gregarias altamente especializadas" generates the following audio:

https://user-images.githubusercontent.com/28763586/212159037-46b8a080-0250-45cd-9d1a-09dcce8ff6f6.mp4

angelhd1999 commented 1 year ago

Hello, thank you for your feedback. I'm getting the next error now. Log:

> Text splitted to sentences.
['Los flamencos son aves gregarias altamente especializadas']
Traceback (most recent call last):
  File "C:\...\.conda\envs\aivt\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\...\.conda\envs\aivt\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\...\.conda\envs\aivt\Scripts\tts.exe\__main__.py", line 7, in <module>
  File "C:\...\TTS\bin\synthesize.py", line 357, in main
    wav = synthesizer.tts(
  File "C:\...\TTS\utils\synthesizer.py", line 278, in tts
    outputs = synthesis(
  File "C:\...\TTS\tts\utils\synthesis.py", line 213, in synthesis
    outputs = run_model_torch(
  File "C:\...\TTS\tts\utils\synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "C:\...\.conda\envs\aivt\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\...\TTS\tts\models\tacotron2.py", line 249, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "C:\...\TTS\tts\layers\tacotron\tacotron2.py", line 108, in inference
    o = layer(o)
  File "C:\...\.conda\envs\aivt\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\...\TTS\tts\layers\tacotron\tacotron2.py", line 40, in forward
    o = self.convolution1d(x)
  File "C:\...\.conda\envs\aivt\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\...\.conda\envs\aivt\lib\site-packages\torch\nn\modules\conv.py", line 263, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "C:\...\.conda\envs\aivt\lib\site-packages\torch\nn\modules\conv.py", line 259, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size

\...\ stands for (what I think) are irrelevant parts of the path.

erogol commented 1 year ago

Can't replicate it so I am closing it.

Try using punctuation at the end for the problem above.

mudomau commented 1 year ago

I could reproduce this on arch linux. I tried first just using pip install tts, but kept running into the issue of empty speech files when trying to generate spanish.

So I tried the proposed fix: pip install git+https://github.com/coqui-ai/TTS.git

I tried both spanish models available.

Using

tts --model_name tts_models/es/mai/tacotron2-DDC --text "Los flamencos son aves gregarias altamente especializadas"

resulted in:

"RuntimeError: Calculated padded input size per channel: (4). Kernel size: (5). Kernel size can't be greater than actual input size"

Adding punctuation (one dot at the end) allows the program to go through this and finish running, but the generated .wav is still empty.

Using

tts --model_name tts_models/es/css10/vits --text "Los flamencos son aves gregarias altamente especializadas"

A correct output is generated.

I'd really like to be able to use tacotron as I like the voice better.

I tried again uninstalling and reinstalling, adressing the model directly, deleting and re-downloading the model, but the result is still the same. Without punctuation the model fails and with punctuation it produces an empty file as output.

@angelhd1999 Did you get it to work?

Edit: I read elsewhere here that the Tacotron2 model doesn't work too well with too short inputs, so I tried increasing the length of text. The output this time was not an empty 1 second .WAV, but rather a 2second .WAV with a loud shrill on it and nothing else. Nothing like the result @Edresson got.

aalvarado commented 1 year ago

using tts_models/es/mai/tacotron2-DDC did not work for me either

deadprogram commented 1 year ago

using tts_models/es/mai/tacotron2-DDC did not work for me either

same.

jordicor commented 1 year ago

same problem here

YA2JA commented 1 year ago

Hello, same issue for the tts_models/fr/mai/tacotron2-DDC

Edresson commented 1 year ago

Hey guys @angelhd1999 @mudomau @aalvarado @deadprogram @jordicor @YA2JA I finally was able to reproduce the error.

It is an requirement issue. You need to install the gruut for the target languages (es or fr on your case). You can do it for FR and ES with the following command: pip install gruut-lang-es gruut-lang-fr

I have used a conda env with python 3.9.12. I have created the environment to test with the following commands:

conda create --name tts python=3.9 
conda activate tts
pip install git+https://github.com/coqui-ai/TTS.git
# install gruut non english languages
pip install gruut-lang-cs gruut-lang-de gruut-lang-en gruut-lang-es gruut-lang-fr gruut-lang-it gruut-lang-nl gruut-lang-pt gruut-lang-ru gruut-lang-sv gruut-lang-ar gruut-lang-fa  gruut-lang-sw

Then I run the command:

tts --model_name  tts_models/es/mai/tacotron2-DDC --text "Los flamencos son aves gregarias altamente especializadas" --out_path tts-output-py39.wav

Then the output was:

https://user-images.githubusercontent.com/28763586/235362023-d7fa774c-644e-43f9-b532-285e9caca30d.mp4

Please let me know if it do not fixes the issue.

@erogol Should we added this packages on the requeriments to avoid this issue in future?