Doing TTS streaming but also with text-streaming (text coming progressively over a stream), locally.
I know inference_stream theoretically is enough for this case, except for the beginning part (which indeed is not so bad to be repeated but nicer would be to be able to skip it too since it's not necessary):
language = language.split("-")[0] # remove the country code
length_scale = 1.0 / max(speed, 0.05)
gpt_cond_latent = gpt_cond_latent.to(self.device) # nicer to be able to skip when doing text-streaming
speaker_embedding = speaker_embedding.to(self.device) # nicer to be able to skip when doing text-streaming
So I've added inference_stream_text (maybe not the best name, let me know if you prefer another) particularly for text-streaming, e.g.:
def text_streaming_generator():
yield "It took me quite a long time to develop a voice and now that I have it I am not going to be silent."
yield "Having discovered not just one, but many voices, I will champion each."
print("Inference with text streaming...")
text_gen = text_streaming_generator()
inf_gen = model.inference_stream_text(
# note `text` param not provided as it will be streamed
"en",
gpt_cond_latent,
speaker_embedding
)
wav_chunks = []
for text in text_gen:
# Add text progressively
model.inference_add_text(text, enable_text_splitting=True)
for chunk in enumerate(inf_gen):
if chunk is None:
break # all chunks generated for the current text
print(f"Received chunk {len(wav_chunks)} of audio length {chunk.shape[-1]}")
wav_chunks.append(chunk)
# Call finalize to discard the inference generator
model.inference_finalize_text()
IMO this also makes for a nicer interface when doing text-streaming, I'll leave it to you to decide :)
Hello,
Doing TTS streaming but also with text-streaming (text coming progressively over a stream), locally. I know
inference_stream
theoretically is enough for this case, except for the beginning part (which indeed is not so bad to be repeated but nicer would be to be able to skip it too since it's not necessary):So I've added
inference_stream_text
(maybe not the best name, let me know if you prefer another) particularly for text-streaming, e.g.:IMO this also makes for a nicer interface when doing text-streaming, I'll leave it to you to decide :)
Cheers! 🍻