Closed henghamao closed 1 month ago
Hello, to generate voice you can compute the gpt_cond_latent, speaker_embedding that you will give to the model (they are speciifc to the speaker_wav).
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=path_audio)
out = model.inference(
text,
language,
gpt_cond_latent,
speaker_embedding)
Assuming you are working with xtts_v2 and , you can save gpt_cond_latent, speaker_embedding to gain some inferece time later.
Thanks. The method looks good. But what is the data type of 'model'? We tried to find out the 'model' with "get_conditioning_latents()', but did not succeed.
from TTS.tts.models import xtts as model #is this the rig model?
model.inference(...)
Personally, I use the TTS Model API because I find it more convenient and flexible. You can find the documentation here: https://docs.coqui.ai/en/dev/models/xtts.html#tts-model-api.
Thanks! By using the method, we could calculate gpt_cond_latent, speaker_embedding, and reused them. However, we found out speaker embedding calculation is fast and the major of time cost is from model.inference(). In addition, model.inference() do not handle long tokens as tts.tts_to_file(). We might need to do segment by ourselves. Anyway, the problem is solved. And to achieve real-time voice generation we might need to try streaming.
Actually model.inference() can handle long tokens sequence by passing the arguments enable_text_splitting=True. For stream you can use model.inference_stream(). With xtts_v2 except using gpu they are nothing implemented that really speed up generation.
π Feature Description Thanks to the project. We had a try and the result is pretty good. However, there is one major issue, the voice cloning speed is slow, especially inference by CPU. We might need to generate the voice several times from the same speaker, could we speed up the process?
Solution Here is how we use the code:
Since we might need to clone the same speaker, and generate the voice for several times, is is possible to speed up the process? Could we export some middle results or fine tuned model, and reused or reloaded the model next time? We might expect the voice generate speed could be as fast as using the single model.
Alternative Solutions
Additional context