Closed Vicopem01 closed 3 months ago
There should be no dependence between the language and the speaker, you can mix them. I made a new release this morning with new models, the new version is better at this, but the old version also should have been able to do that.
can the pretrained model do this? or do i just fine tine or train a model instead?
also, i cannot get the utterance cloner to run after the update
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
302 weight, bias, self.stride,
303 _single(0), self.dilation, self.groups)
--> 304 return F.conv1d(input, weight, bias, self.stride,
305 self.padding, self.dilation, self.groups)
306
RuntimeError: Given groups=1, weight of size [384, 1, 1], expected input[1, 34, 1] to have 1 channels, but got 34 channels instead
Yes, the pretrained model is trained specifically to be good at switching between languages. Finetuning can however help a lot for some languages.
I will take a look at the prosody cloning when I get back from vacation, it's possible the last release broke something, I didn't have time to test the prosody cloning properly.
There was indeed a bug in the prosody cloning, the dimensions of the tensor were not in the right order. I fixed it, it should work now.
The voice cloning and prosody cloning are amazing. But i want to clone the prosody but synthesize speech in another language. Not having any luck so far, any help?
I noticed the models only accepts both the reference audio and text in the same language, but is it possible to use english as reference and spanish as text while specifying spa as langauge, just to clone the english prosody and transfer style