DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.47k stars 166 forks source link

Voice cloning not working #178

Closed uni-saurabh-vyas closed 4 months ago

uni-saurabh-vyas commented 4 months ago

Logs

torchvision is not available - cannot save figures
Directory '/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/data/ims/test' already exists.
Directory '/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/data/ims/test_kaldi' already exists.
/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/ims/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/mnt/efs/saurabh/saurabh/tools/xtts/src_speakers_hindi3/2ec8b903-380b-4d91-9c5f-9e6e8306a634-A_003387-004535_sp1.1.wav
/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/ims/lib/python3.8/site-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
Now synthesizing: my text to say
/mnt/efs/saurabh/saurabh/tools/xtts/src_speakers_hindi3/EA135614-E35D-4F14-93A5-70FA45293867-32bd8f75-c35f-4a1e-a902-533f36f81a7c-1-275870.wav
Now synthesizing: my text to say
/mnt/efs/saurabh/saurabh/tools/xtts/src_speakers_hindi3/6dee3eed-2b24-40b1-a8f8-465516574b6b-B_032048-033021_sp0.9.wav
Traceback (most recent call last):
  File "/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/IMS-Toucan/ims_tts.py", line 63, in <module>
    tts.set_utterance_embedding(speaker_reference)
  File "/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/IMS-Toucan/InferenceInterfaces/ToucanTTSInterface.py", line 112, in set_utterance_embedding
    wave = Resample(orig_freq=sr, new_freq=16000).to(self.device)(torch.tensor(wave, device=self.device, dtype=torch.float32))
  File "/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/ims/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/ims/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/ims/lib/python3.8/site-packages/torchaudio/transforms/_transforms.py", line 979, in forward
    return _apply_sinc_resample_kernel(waveform, self.orig_freq, self.new_freq, self.gcd, self.kernel, self.width)
  File "/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/ims/lib/python3.8/site-packages/torchaudio/functional/functional.py", line 1460, in _apply_sinc_resample_kernel
    waveform = waveform.view(-1, shape[-1])
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Code:

lang="eng"
tts = ToucanTTSInterface(device="cuda" if torch.cuda.is_available() else "cpu", tts_model_path="Meta", language=lang)

input_text = "my text to say"

# Loop through the speaker reference audio files in the folder
speaker_reference_folder = "/mnt/efs/saurabh/saurabh/tools/xtts/src_speakers_hindi3"
#dst_dir="/mnt/efs/jeena/to_saurabh/Hindi_phrases_sents/data/ims/test"

for file_name in os.listdir(speaker_reference_folder):
    if file_name.endswith('.wav'):
        speaker_reference = os.path.join(speaker_reference_folder, file_name)

        print(speaker_reference)

        # Set the speaker embedding to clone the voice
        tts.set_utterance_embedding(speaker_reference)

        # Synthesize speech with the cloned voice
        output_file_name = f"{dst_dir}/cloned_voice.wav"

        tts.read_to_file(text_list=[input_text], file_location=output_file_name)

del tts

Any help would be appreciated.

Flux9665 commented 4 months ago

Since the error message says something about a tensor with 0 elements, I suspect that there might be a problem with the audio you loaded. Can you try printing the shape of the audio that goes into the resample function inside of the tts.set_utterance_embedding function?

uni-saurabh-vyas commented 4 months ago

the audio file seems to be there

Toucan/InferenceInterfaces/ToucanTTSInterface.py(108)set_utterance_embedding() -> if len(wave.shape) > 1: # oh no, we found a stereo audio! (Pdb) wave.shape (83498,)

Channels : 1 Sample Rate : 8000 Precision : 16-bit Sample Encoding: 16-bit Signed Integer PCM

uni-saurabh-vyas commented 4 months ago

Nevermind, I think one of the files had issue/empty during iteration, adding try except block solved the issue.