YourTTS: Not able to use reference_wav in Synthesizer

fungus75 commented 1 year ago

Describe the bug

I trained a YourTTS model with the Thorsten Dataset (downgraded to 16000). Except for the problem described in https://github.com/coqui-ai/TTS/issues/2391 the training and voice generation worked perfectly.

But now I wanted to use reference_wav for voice conversion and this throws an error.

To Reproduce

I started that script in exactly the same environment I have created the model:

---cut-- from TTS.utils.synthesizer import Synthesizer MODEL_PATH="best_model.pth" CONFIG_PATH="config.json" OUT_PATH="." s = Synthesizer(MODEL_PATH,CONFIG_PATH,use_cuda=True) wav=s.tts("Hallo ich bin Eric und wie geht es euch?",reference_wav="reference.wav") s.save_wav(wav,os.path.join(OUT_PATH,"test.wav")) ---cut--

Expected behavior

Saves the file test.wav in the given folder, but crashes

Logs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[30], line 1
----> 1 wav=s.tts("Hallo ich bin Eric und wie geht es euch?",reference_wav="reference.wav")

File ~/MachineLearning/Voice/TTS/TTS/utils/synthesizer.py:344, in Synthesizer.tts(self, text, speaker_name, language_name, speaker_wav, style_wav, style_text, reference_wav, reference_speaker_name)
    340     else:
    341         reference_speaker_embedding = self.tts_model.speaker_manager.compute_embedding_from_clip(
    342             reference_wav
    343         )
--> 344 outputs = transfer_voice(
    345     model=self.tts_model,
    346     CONFIG=self.tts_config,
    347     use_cuda=self.use_cuda,
    348     reference_wav=reference_wav,
    349     speaker_id=speaker_id,
    350     d_vector=speaker_embedding,
    351     use_griffin_lim=use_gl,
    352     reference_speaker_id=reference_speaker_id,
    353     reference_d_vector=reference_speaker_embedding,
    354 )
    355 waveform = outputs
    356 if not use_gl:

File ~/MachineLearning/Voice/TTS/TTS/tts/utils/synthesis.py:315, in transfer_voice(model, CONFIG, use_cuda, reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector, do_trim_silence, use_griffin_lim)
    313 else:
    314     _func = model.inference_voice_conversion
--> 315 model_outputs = _func(reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
    317 # convert outputs to numpy
    318 # plot results
    319 wav = None

File ~/anaconda3/envs/voice/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/MachineLearning/Voice/TTS/TTS/tts/models/vits.py:1197, in Vits.inference_voice_conversion(self, reference_wav, speaker_id, d_vector, reference_speaker_id, reference_d_vector)
   1195 speaker_cond_src = reference_speaker_id if reference_speaker_id is not None else reference_d_vector
   1196 speaker_cond_tgt = speaker_id if speaker_id is not None else d_vector
-> 1197 wav, _, _ = self.voice_conversion(y, y_lengths, speaker_cond_src, speaker_cond_tgt)
   1198 return wav

File ~/MachineLearning/Voice/TTS/TTS/tts/models/vits.py:1218, in Vits.voice_conversion(self, y, y_lengths, speaker_cond_src, speaker_cond_tgt)
   1216 elif not self.args.use_speaker_embedding and self.args.use_d_vector_file:
   1217     g_src = F.normalize(speaker_cond_src).unsqueeze(-1)
-> 1218     g_tgt = F.normalize(speaker_cond_tgt).unsqueeze(-1)
   1219 else:
   1220     raise RuntimeError(" [!] Voice conversion is only supported on multi-speaker models.")

File ~/anaconda3/envs/voice/lib/python3.10/site-packages/torch/nn/functional.py:4632, in normalize(input, p, dim, eps, out)
   4630     return handle_torch_function(normalize, (input, out), input, p=p, dim=dim, eps=eps, out=out)
   4631 if out is None:
-> 4632     denom = input.norm(p, dim, keepdim=True).clamp_min(eps).expand_as(input)
   4633     return input / denom
   4634 else:

File ~/anaconda3/envs/voice/lib/python3.10/site-packages/torch/_tensor.py:638, in Tensor.norm(self, p, dim, keepdim, dtype)
    634 if has_torch_function_unary(self):
    635     return handle_torch_function(
    636         Tensor.norm, (self,), self, p=p, dim=dim, keepdim=keepdim, dtype=dtype
    637     )
--> 638 return torch.norm(self, p, dim, keepdim, dtype=dtype)

File ~/anaconda3/envs/voice/lib/python3.10/site-packages/torch/functional.py:1529, in norm(input, p, dim, keepdim, out, dtype)
   1527 if out is None:
   1528     if dtype is None:
-> 1529         return _VF.norm(input, p, _dim, keepdim=keepdim)  # type: ignore[attr-defined]
   1530     else:
   1531         return _VF.norm(input, p, _dim, keepdim=keepdim, dtype=dtype)  # type: ignore[attr-defined]

RuntimeError: norm(): input dtype should be either floating point or complex. Got Long instead.

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "TTS": "0.10.2",
        "numpy": "1.22.4"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.10.9",
        "version": "#1 SMP Debian 5.10.162-1 (2023-01-21)"
    }
}

Additional context

No response

fungus75 commented 1 year ago

If necessary, the file refence.wav could be downloaded here: https://drive.google.com/drive/folders/1SbMCLyD3YDie6lVl9lPgEdxZA9U0ZnJm?usp=sharing

thorstenMueller commented 1 year ago

I'm curious on the final result as it's based on my voice dataset 🙂.

fungus75 commented 1 year ago

If necessary, you can download the model (Checkpoint) and config.json from here: https://drive.google.com/drive/folders/1bU9ObB1Z30VoT5miTXEW2bDW1EODw-gr?usp=sharing

Edresson commented 1 year ago

Hi @fungus75,

Looks like you have trained YourTTS only with Thorsten Dataset which is a single-speaker dataset. In this way, You will not be able to voice conversion with a good performance.

What is broken in your inference is that for voice conversion you need to provide the speaker_wav and the reference_wav. You can see some instructions here: https://github.com/Edresson/YourTTS#voice-conversion

AraKchrUser commented 1 year ago

Hello. Tell me, I have the same problem, how did you solve it?

fungus75 commented 1 year ago

You have to train a multi-speaker dataset. Than you can use the reference.wav. I trained a single-speaker only. As soon as I trained a multi-speaker, it worked.

coqui-ai / TTS