coqui-ai / TTS

πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.09k stars 4.28k forks source link

[Bug] VITS Model Seemingly Incompatible with Hifigan_v2 Vocoder #1326

Closed Ijwu closed 2 years ago

Ijwu commented 2 years ago

πŸ› Description

Attempting to synthesize speech using a combination of the VITS model and hifigan_v2 vocoder leads to an exception.

To Reproduce

Run the command

tts --model_name tts_models/en/ljspeech/vits --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --text "Testing vits with hifigan vocoder just to see if it works at all."

Here is the command output, with exception, being shown:

 > tts_models/en/ljspeech/vits is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: vits
 > Vocoder Model: hifigan
 > Generator Model: hifigan_generator
 > Discriminator Model: hifigan_discriminator
Removing weight norm...
 > Text: Testing vits with hifigan vocoder just to see if it works at all.
 > Text splitted to sentences.
['Testing vits with hifigan vocoder just to see if it works at all.']
Traceback (most recent call last):
  File "/home/ijwu/.local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/bin/synthesize.py", line 287, in main
    wav = synthesizer.tts(args.text, args.speaker_idx, args.language_idx, args.speaker_wav)
  File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 363, in tts
    waveform = self.vocoder_model.inference(vocoder_input.to(device_type))
  File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/vocoder/models/gan.py", line 64, in inference
    return self.model_g.inference(x)
  File "/home/ijwu/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/vocoder/models/hifigan_generator.py", line 281, in inference
    return self.forward(c)
  File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/vocoder/models/hifigan_generator.py", line 248, in forward
    o = self.conv_pre(x)
  File "/home/ijwu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ijwu/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 301, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ijwu/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 297, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [128, 80, 7], expected input[1, 82432, 11] to have 80 channels, but got 82432 channels instead

Expected behavior

An output WAV is generated containing the synthesized speech.

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla P40",
            "Tesla P40"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.2+cu102",
        "TTS": "0.5.0",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.8.10",
        "version": "#31~20.04.2-Ubuntu SMP Tue Jan 18 08:46:15 UTC 2022"
    }
}

Additional context

This is running in a VM in Azure, for context. I doubt that affects anything, but I figured I'd be open about my environment in case it leads to something.

I've tried using a fine-tuned hifigan model at first but assumed I did something wrong in the training. So I went on to test the combination of hifigan+VITS and found it fails even with the coqui provided models.

Is this a known issue that I'm stumbling into out of ignorance? Is it solvable?

Extra Note

Thank you, for what it is worth, I have managed to fine tune a model and synthesize something close to my voice. It is a fun project and I'm really enjoying the results! I hoped to get something a little more finely tuned by combining my tuned VITS model with a tuned vocoder model but that led to this issue. If you have advice on how to approach tuning a vocoder for my model, that would be deeply appreciated as well.

erogol commented 2 years ago

VITs doesn't need a vocoder model. It has a vocoder component in itself.