coqui-ai / TTS

πŸΈπŸ’¬ - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
33.54k stars 4.08k forks source link

[Feature request] Speed up voice cloning from the same speaker #3847

Closed henghamao closed 1 month ago

henghamao commented 1 month ago

πŸš€ Feature Description Thanks to the project. We had a try and the result is pretty good. However, there is one major issue, the voice cloning speed is slow, especially inference by CPU. We might need to generate the voice several times from the same speaker, could we speed up the process?

Solution Here is how we use the code:

 # Init TTS
    tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

    # Text to speech to a file
    tts.tts_to_file(text=words, language="zh-cn", file_path=out, speaker_wav="data/audio/sun1.wav")

Since we might need to clone the same speaker, and generate the voice for several times, is is possible to speed up the process? Could we export some middle results or fine tuned model, and reused or reloaded the model next time? We might expect the voice generate speed could be as fast as using the single model.

Alternative Solutions

Additional context

greg2705 commented 1 month ago

Hello, to generate voice you can compute the gpt_cond_latent, speaker_embedding that you will give to the model (they are speciifc to the speaker_wav).

gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=path_audio)
 out = model.inference(
        text,
        language,
        gpt_cond_latent,
        speaker_embedding)

Assuming you are working with xtts_v2 and , you can save gpt_cond_latent, speaker_embedding to gain some inferece time later.

henghamao commented 1 month ago

Thanks. The method looks good. But what is the data type of 'model'? We tried to find out the 'model' with "get_conditioning_latents()', but did not succeed.

from TTS.tts.models import xtts as model   #is this the rig model?

model.inference(...)
greg2705 commented 1 month ago

Personally, I use the TTS Model API because I find it more convenient and flexible. You can find the documentation here: https://docs.coqui.ai/en/dev/models/xtts.html#tts-model-api.

henghamao commented 1 month ago

Thanks! By using the method, we could calculate gpt_cond_latent, speaker_embedding, and reused them. However, we found out speaker embedding calculation is fast and the major of time cost is from model.inference(). In addition, model.inference() do not handle long tokens as tts.tts_to_file(). We might need to do segment by ourselves. Anyway, the problem is solved. And to achieve real-time voice generation we might need to try streaming.

greg2705 commented 1 month ago

Actually model.inference() can handle long tokens sequence by passing the arguments enable_text_splitting=True. For stream you can use model.inference_stream(). With xtts_v2 except using gpu they are nothing implemented that really speed up generation.