Attempting to synthesize speech using a combination of the VITS model and hifigan_v2 vocoder leads to an exception.
To Reproduce
Run the command
tts --model_name tts_models/en/ljspeech/vits --vocoder_name vocoder_models/en/ljspeech/hifigan_v2 --text "Testing vits with hifigan vocoder just to see if it works at all."
Here is the command output, with exception, being shown:
> tts_models/en/ljspeech/vits is already downloaded.
> vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
> Using model: vits
> Vocoder Model: hifigan
> Generator Model: hifigan_generator
> Discriminator Model: hifigan_discriminator
Removing weight norm...
> Text: Testing vits with hifigan vocoder just to see if it works at all.
> Text splitted to sentences.
['Testing vits with hifigan vocoder just to see if it works at all.']
Traceback (most recent call last):
File "/home/ijwu/.local/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/bin/synthesize.py", line 287, in main
wav = synthesizer.tts(args.text, args.speaker_idx, args.language_idx, args.speaker_wav)
File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 363, in tts
waveform = self.vocoder_model.inference(vocoder_input.to(device_type))
File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/vocoder/models/gan.py", line 64, in inference
return self.model_g.inference(x)
File "/home/ijwu/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/vocoder/models/hifigan_generator.py", line 281, in inference
return self.forward(c)
File "/home/ijwu/.local/lib/python3.8/site-packages/TTS/vocoder/models/hifigan_generator.py", line 248, in forward
o = self.conv_pre(x)
File "/home/ijwu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ijwu/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 301, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ijwu/.local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 297, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [128, 80, 7], expected input[1, 82432, 11] to have 80 channels, but got 82432 channels instead
Expected behavior
An output WAV is generated containing the synthesized speech.
This is running in a VM in Azure, for context. I doubt that affects anything, but I figured I'd be open about my environment in case it leads to something.
I've tried using a fine-tuned hifigan model at first but assumed I did something wrong in the training. So I went on to test the combination of hifigan+VITS and found it fails even with the coqui provided models.
Is this a known issue that I'm stumbling into out of ignorance? Is it solvable?
Extra Note
Thank you, for what it is worth, I have managed to fine tune a model and synthesize something close to my voice. It is a fun project and I'm really enjoying the results! I hoped to get something a little more finely tuned by combining my tuned VITS model with a tuned vocoder model but that led to this issue. If you have advice on how to approach tuning a vocoder for my model, that would be deeply appreciated as well.
π Description
Attempting to synthesize speech using a combination of the VITS model and hifigan_v2 vocoder leads to an exception.
To Reproduce
Run the command
Here is the command output, with exception, being shown:
Expected behavior
An output WAV is generated containing the synthesized speech.
Environment
Additional context
This is running in a VM in Azure, for context. I doubt that affects anything, but I figured I'd be open about my environment in case it leads to something.
I've tried using a fine-tuned hifigan model at first but assumed I did something wrong in the training. So I went on to test the combination of hifigan+VITS and found it fails even with the coqui provided models.
Is this a known issue that I'm stumbling into out of ignorance? Is it solvable?
Extra Note
Thank you, for what it is worth, I have managed to fine tune a model and synthesize something close to my voice. It is a fun project and I'm really enjoying the results! I hoped to get something a little more finely tuned by combining my tuned VITS model with a tuned vocoder model but that led to this issue. If you have advice on how to approach tuning a vocoder for my model, that would be deeply appreciated as well.