[Bug] AttributeError: 'NoneType' object has no attribute 'load_wav' when using tts_with_vc_to_file

pprobst commented 1 year ago

Describe the bug

Fix #3108 breaks tts_with_vc_to_file at least with VITS.

See: https://github.com/coqui-ai/TTS/blob/6fef4f9067c0647258e0cd1d2998716565f59330/TTS/api.py#L463

By changing the line from: self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name,speaker_wav=speaker_wav)

To its pre-0.19.1 version: self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name)

The issue is solved.

Please take a look at the script below for reproduction.

To Reproduce

Clone the Coqui TTS repository and install the dependencies as specified in the README file. Then, run the following script from TTS's root directory, but replace speaker_wav with any audio file you have at hand:

#!/usr/bin/env python3

import torch
from TTS.api import TTS

device = "cuda" if torch.cuda.is_available() else "cpu"

tts = TTS("tts_models/pt/cv/vits").to(device)

tts.tts_with_vc_to_file(
    text="A radiografia apresentou algumas lesões no fêmur esquerdo ponto parágrafo",
    speaker_wav="test_audios/1693678335_24253176-processed.wav",
    file_path="test_audios/output.wav",
)

Expected behavior

The output audio file defined in file_path is generated, saying the sentence in text with the voice cloned from speaker_wav.

Logs

> tts_models/pt/cv/vits is already downloaded.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > initialization of language-embedding layers.
/home/probst/.pyenv/versions/coqui-tts/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 > Text splitted to sentences.
['A radiografia apresentou algumas lesões no fêmur esquerdo ponto parágrafo']
Traceback (most recent call last):
  File "/home/probst/Projects/TTS-iara/./test.py", line 15, in <module>
    tts.tts_with_vc_to_file(
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 488, in tts_with_vc_to_file
    wav = self.tts_with_vc(text=text, language=language, speaker_wav=speaker_wav)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 463, in tts_with_vc
    self.tts_to_file(text=text, speaker=None, language=language, file_path=fp.name, speaker_wav=speaker_wav)
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 403, in tts_to_file
    wav = self.tts(text=text, speaker=speaker, language=language, speaker_wav=speaker_wav, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/api.py", line 341, in tts
    wav = self.synthesizer.tts(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/utils/synthesizer.py", line 362, in tts
    speaker_embedding = self.tts_model.speaker_manager.compute_embedding_from_clip(speaker_wav)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/tts/utils/managers.py", line 365, in compute_embedding_from_clip
    embedding = _compute(wav_file)
                ^^^^^^^^^^^^^^^^^^
  File "/home/probst/Projects/TTS-iara/TTS/tts/utils/managers.py", line 342, in _compute
    waveform = self.encoder_ap.load_wav(wav_file, sr=self.encoder_ap.sample_rate)
               ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'load_wav'

Environment

- 🐸TTS Version: 0.19.1
- PyTorch Version: 2.1.0+cu121
- OS: Artix Linux

Not using GPU.
Installed everything through pip in a virtual environment created with pyenv.

Additional context

No response

erogol commented 1 year ago

@Aya-AlJafari can you look at this one?

TheLocalLab commented 1 year ago

If anyone is still looking through this issue, you might want to take a look at #1440

erogol commented 1 year ago

@Aya-AlJafari any updates?

eginhard commented 1 year ago

@erogol The original issue (#3067) was people trying to use tts.tts_with_vc_to_file() with XTTS and was "fixed" in ~~#3108~~#3109. But XTTS has integrated VC and you can just do tts.tts_to_file(..., speaker_wav="..."), there is no point in passing it through FreeVC afterwards. IMHO, #3109 should be reverted because it breaks tts.tts_with_vc_to_file() for any model that doesn't have integrated VC, i.e. all models this method is meant for. Perhaps, tts.tts_with_vc_to_file() could throw a better error message when it's called for models that already support VC.

coqui-ai / TTS