Streaming TTS in SillyTavern/Ollama

erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.

GNU Affero General Public License v3.0

816 stars 91 forks source link

Streaming TTS in SillyTavern/Ollama #186

Closed baconsplit closed 4 months ago

baconsplit commented 4 months ago

Describe the request I am looking for a way to have AllTalk generate a streaming TTS voice while my LLM in SillyTavern is still generating the output text. Is that possible? It would reduce the complete generation time of text generation and TTS. Right now AllTalk waits for my LLM/SillyTavern to finish generating text, and then first begins to generate a voice.

To Reproduce Have SillyTavern with local LLM via Ollama and local AllTalk setup, hit "generate" and watch how the text generation is output in a stream but the TTS begins generation first after the text gen stream is done.

Desktop (please complete the following information): AllTalk was updated: 25.04.24 Custom Python environment: default installation on windows Text-generation-webUI was updated: using Ollama

Additional context Sorry if this is the wrong place to ask for help.

erew123 commented 4 months ago

Hi @baconsplit

I've tried to write this 3x as I've had multiple thoughts on it. Currently the answer is no. Reasons are:

1) SillyTavern is what decides when to send text to an Extension/TTS engine and currently will only play back that 1x request sent. So SillyTavern would first have to be capable of streaming the text to the TTS engine as its streamed into ST from the LLM. 2) Although the Coqui engine calls the resulting TTS generation "streaming" it still has to have the 1x block of text handed to it in one lump. Now in theory, you could say, ok lets generate the 1st sentence of text from the LLM then just keep adding on the next bits as they are streamed in. Again, there is no way for ST to handle sending a stream of multiple chunks to TTS (as far as I am aware) and of course this goes for any of the engines built into TTS (as far as I am aware).

So it would require a code change on the ST side before I could even think about trying to implement something like this.

Sorry.

baconsplit commented 4 months ago

Alright understood. I already suspected a limitation by ST itself, because under the AllTalk option there is no way to enable streaming. But I thought asking here might not hurt.

I will rather watch the development of ST then, AllTalk is working here as expected. Its pretty good btw :)

erew123 commented 4 months ago

No harm in asking! :)

If I had no coding to do myself, then I might take a look at the ST code, but its not code I know in any way and I am deep deep in a lot of code currently for v2 of AllTalk, so Im not looking at anything else for a long long time.

Thanks