coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
31.64k stars 3.78k forks source link

[Bug] tts.tts_with_vc_to_file cannot use cpu #3797

Open pieris98 opened 1 week ago

pieris98 commented 1 week ago

Describe the bug

Similar to #3787, but also when running xtts_v2 model with voice cloning (vocoder model), using device='cpu' results to the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)` 

To Reproduce

import torch from TTS.api import TTS

device = "cpu" print(device)

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_with_vc_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/path/to/voice_sample.wav", language="en",file_path="/path/to/outputs/xttsv2_en_output.wav")

Expected behavior

The inference should run without using CUDA or reporting any CUDA/CUDNN/GPU-related errors.

Logs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 12
      7 # !tts --text "hello world" \
      8 # --model_name "tts_models/en/ljspeech/glow-tts" \
      9 # --out_path output.wav
     11 tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
---> 12 tts.tts_with_vc_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/home/cherry/dev/coqui/steve_taylor.wav", language="en",file_path="/home/cherry/dev/coqui/outputs/xttsv2_en_output.wav")

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:455, in TTS.tts_with_vc_to_file(self, text, language, speaker_wav, file_path, speaker, split_sentences)
    423 def tts_with_vc_to_file(
    424     self,
    425     text: str,
   (...)
    430     split_sentences: bool = True,
    431 ):
    432     """Convert text to speech with voice conversion and save to file.
    433 
    434     Check `tts_with_vc` for more details.
   (...)
    453             applicable to the 🐸TTS models. Defaults to True.
    454     """
--> 455     wav = self.tts_with_vc(
    456         text=text, language=language, speaker_wav=speaker_wav, speaker=speaker, split_sentences=split_sentences
    457     )
    458     save_wav(wav=wav, path=file_path, sample_rate=self.voice_converter.vc_config.audio.output_sample_rate)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:419, in TTS.tts_with_vc(self, text, language, speaker_wav, speaker, split_sentences)
    415     self.tts_to_file(
    416         text=text, speaker=speaker, language=language, file_path=fp.name, split_sentences=split_sentences
    417     )
    418 if self.voice_converter is None:
--> 419     self.load_vc_model_by_name("voice_conversion_models/multilingual/vctk/freevc24")
    420 wav = self.voice_converter.voice_conversion(source_wav=fp.name, target_wav=speaker_wav)
    421 return wav

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/api.py:157, in TTS.load_vc_model_by_name(self, model_name, gpu)
    155 self.model_name = model_name
    156 model_path, config_path, _, _, _ = self.download_model_by_name(model_name)
--> 157 self.voice_converter = Synthesizer(vc_checkpoint=model_path, vc_config=config_path, use_cuda=gpu)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/utils/synthesizer.py:101, in Synthesizer.__init__(self, tts_checkpoint, tts_config_path, tts_speakers_file, tts_languages_file, vocoder_checkpoint, vocoder_config, encoder_checkpoint, encoder_config, vc_checkpoint, vc_config, model_dir, voice_dir, use_cuda)
     98     self.output_sample_rate = self.vocoder_config.audio["sample_rate"]
    100 if vc_checkpoint:
--> 101     self._load_vc(vc_checkpoint, vc_config, use_cuda)
    102     self.output_sample_rate = self.vc_config.audio["output_sample_rate"]
    104 if model_dir:

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/utils/synthesizer.py:139, in Synthesizer._load_vc(self, vc_checkpoint, vc_config_path, use_cuda)
    137 # pylint: disable=global-statement
    138 self.vc_config = load_config(vc_config_path)
--> 139 self.vc_model = setup_vc_model(config=self.vc_config)
    140 self.vc_model.load_checkpoint(self.vc_config, vc_checkpoint)
    141 if use_cuda:

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/__init__.py:16, in setup_model(config, samples)
     14 if "model" in config and config["model"].lower() == "freevc":
     15     MyModel = importlib.import_module("TTS.vc.models.freevc").FreeVC
---> 16     model = MyModel.init_from_config(config, samples)
     17 return model

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:552, in FreeVC.init_from_config(config, samples, verbose)
    550 @staticmethod
    551 def init_from_config(config: FreeVCConfig, samples: Union[List[List], List[Dict]] = None, verbose=True):
--> 552     model = FreeVC(config)
    553     return model

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:370, in FreeVC.__init__(self, config, speaker_manager)
    368     self.enc_spk = SpeakerEncoder(model_hidden_size=self.gin_channels, model_embedding_size=self.gin_channels)
    369 else:
--> 370     self.load_pretrained_speaker_encoder()
    372 self.wavlm = get_wavlm()

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/models/freevc.py:381, in FreeVC.load_pretrained_speaker_encoder(self)
    379 """Load pretrained speaker encoder model as mentioned in the paper."""
    380 print(" > Loading pretrained speaker encoder model ...")
--> 381 self.enc_spk_ex = SpeakerEncoderEx(
    382     "https://github.com/coqui-ai/TTS/releases/download/v0.13.0_models/speaker_encoder.pt"
    383 )

File ~/miniconda/envs/tts/lib/python3.9/site-packages/TTS/vc/modules/freevc/speaker_encoder/speaker_encoder.py:45, in SpeakerEncoder.__init__(self, weights_fpath, device, verbose)
     42 checkpoint = load_fsspec(weights_fpath, map_location="cpu")
     44 self.load_state_dict(checkpoint["model_state"], strict=False)
---> 45 self.to(device)
     47 if verbose:
     48     print("Loaded the voice encoder model on %s in %.2f seconds." % (device.type, timer() - start))

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/module.py:1173, in Module.to(self, *args, **kwargs)
   1170         else:
   1171             raise
-> 1173 return self._apply(convert)

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/module.py:779, in Module._apply(self, fn, recurse)
    777 if recurse:
    778     for module in self.children():
--> 779         module._apply(fn)
    781 def compute_should_use_set_data(tensor, tensor_applied):
    782     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    783         # If the new tensor has compatible tensor type as the existing tensor,
    784         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    789         # global flag to let the user control whether they want the future
    790         # behavior of overwriting the existing tensor or not.

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:222, in RNNBase._apply(self, fn, recurse)
    217 ret = super()._apply(fn, recurse)
    219 # Resets _flat_weights
    220 # Note: be v. careful before removing this, as 3rd party device types
    221 # likely rely on this behavior to properly .to() modules like LSTM.
--> 222 self._init_flat_weights()
    224 return ret

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:158, in RNNBase._init_flat_weights(self)
    154 self._flat_weights = [getattr(self, wn) if hasattr(self, wn) else None
    155                       for wn in self._flat_weights_names]
    156 self._flat_weight_refs = [weakref.ref(w) if w is not None else None
    157                           for w in self._flat_weights]
--> 158 self.flatten_parameters()

File ~/miniconda/envs/tts/lib/python3.9/site-packages/torch/nn/modules/rnn.py:209, in RNNBase.flatten_parameters(self)
    207 if self.proj_size > 0:
    208     num_weights += 1
--> 209 torch._cudnn_rnn_flatten_weight(
    210     self._flat_weights, num_weights,
    211     self.input_size, rnn.get_cudnn_mode(self.mode),
    212     self.hidden_size, self.proj_size, self.num_layers,
    213     self.batch_first, bool(self.bidirectional))

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3060 Laptop GPU"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.3.1+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "",
        "python": "3.9.0",
        "version": "#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)"
    }
}

Additional context

Note: Even though I do have CUDA and an NVIDIA GPU on my laptop, I want to use CPU because the VRAM of my GPU is not enough for the model I wanted to use.

eginhard commented 5 days ago

The XTTS model natively supports voice cloning, so just use the following (and pick just one of speaker and speaker_wav, depending on which of them you need):

from TTS.api import TTS

device = "cpu"
print(device)

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_to_file(text="Hello world!", speaker='Andrew Chipper',speaker_wav="/path/to/voice_sample.wav", language="en",file_path="/path/to/outputs/xttsv2_en_output.wav")

This should run correctly on the CPU. The with_vc would pass the already cloned output through an additional voice conversion model (FreeVC), but that's not necessary here and probably leads to worse results.

pieris98 commented 5 days ago

Hey Enno, thanks a lot for the pointer, I didn't realise that some models have voice cloning built in rather than with tts.tts_with_vc_to_file().

I was then trying to run the model in tts-server and noticed this issue #3369 so I just wanted to point it out as it seems more important to solve in the codebase.