Real Time Streaming - Githubissues

mercuryyy commented 11 months ago

Is it possible at exec of TTS cmd, to Stream the results in chunks to something like a Temp_stream.wav file that will be playable immediately after exec as it is still being created.

So for example if i am Transcribing 100 words and it takes me 4 seconds. but i want to play the file in real time, meaning at the point of exec of the TTS command i want the Audio to start playing, you can essentially do real time.

Piper does this - https://github.com/rhasspy/piper echo 'This sentence is spoken first. This sentence is synthesized while the first sentence is spoken.' | \ ./piper --model en_US-lessac-medium.onnx --output-raw | \ aplay -r 22050 -f S16_LE -t raw -

But the models on Coqui_tts are better but longer to exec, but if we can Stream it wouldn't matter and can do real time.

Sascha353 commented 11 months ago

Do you mean audio streaming decoupled from the main text-generation-webui wav handling? A tts-engine independent solution inside the webui would be best but as a workaround it could be implemented in alltalk_tts or any tts extension, without returning the webui any audio chunks and directly streaming it with a library like sounddevice. One disadvantage would be that the user has no control to pause, stop or continue the playback.

mercuryyy commented 11 months ago

I can implement it later into text-generation-webui the main thing i am trying to achieve is being able to generate an instant playable .wav file that can be streamed in chunks so we can achieve real time TTS

The main thing is streaming the raw audio to stdout as its produced

erew123 commented 11 months ago

Streaming is possible with this https://github.com/KoljaB/RealtimeTTS though that is another step down the line. My current workload is re-tidying all the documentation both on this github and within the app. Catching a few minor bugs/issues.

Then I'm working on the new API for 3rd party/standalone, which is 70-80% completed.

From there, Ill look at options for other TTS engine and features such as the above. However, its worth noting there is a memory overhead for this and there will be coding around certain things like the LowVRAM option as the two, whilst not incompatible as such, you're just going to be shuffling the TTS model between VRAM and system RAM all the time, resulting in zero gain and probably a lot of complaints around speed.

Sascha353 commented 11 months ago

There are a view more things to consider:

How and when does text input reach the tts-engine: To receive answers as fast as possible we should start here. TG-webui can be used in streaming mode or normal mode. Normal mode is what is currently used by all tts engines AFAIK, but is obviously not the best option in terms of speed, as synthesis is starting only after the whole reply is made. Streaming mode would feed the tts-engine individual words which can't be used to generate a coherent sentence. So if we talk about instant or real time tts, we are talking about sentence by sentence and not word by word streaming. So there are two options here. Add "sentence-streaming" in TG-webui or add a feature into a tts-extension which gathers individual words from TG-webui in streaming mode and waits till at least one sentence is generated. For each complete sentence it calls the tts engine for synthesis. Sentences could be also very short so there must be a little bit more logic to it, to wait for a certain amount of characters/tokens.
I'm not sure if RealtimeTTS is capable of doing this. I know it can split a whole paragraph into sentences but does it also work with word-by-word streaming from the text generation?
Parallelisation: The "word-listener", tts-synthesis and playback of audio must be done in parallel
xtts actually has a nativ streaming mode which I did not test yet and which they them self did not use in their Voice chat space. In that space they do pretty much what should be the fastest way of getting audio results from the tts, they also utilized gradio to still be able to control the streamed audio
It's true that this would be a feature which mostly benefits systems which have spare VRAM, as it would do text-generation and speech synthesis in parallel and all the models must be loaded @erew123 do you think it could be an optional feature for people which most likely wouldn't use the LowVRAM feature anyway or is out of scope, due to the LowVram "incompatiblily"?

mercuryyy commented 11 months ago

@Sascha353 great overview, and great find on https://docs.coqui.ai/en/dev/models/xtts.html#streaming-manually Should be easy to implement into the addon.

erew123 commented 11 months ago

@Sascha353 @mercuryyy xtts actually has a nativ streaming mode which I did not test yet Not tested it either, but had spotted it. I am curious how well it will handle compared to the other one I suggested. Also how it deals with sentence breakdown.

Should be easy to implement into the addon. Yes and no. All the other code will need to be caveated around e.g. got to make sure that low VRAM is disabled when people use such a mode. Its probable that it wont work with the narrator function, depending on how it would stream split sentences, so would need testing and then potentially code to flip that off and notify the user. Then of course, Ill need to document it because, if I don't, ill be getting all the questions why X isn't working correctly etc.

is out of scope, due to the LowVram "incompatibility"? Technically speaking, not out of scope. However, I've only built AllTalk in the last few weeks. There's been good adoption, but I've also been fighting some fires here and there, a couple of minor hiccups and also helping the less technical people who have struggled with some things. Hence my focus on getting Alltalk very stable in its current form, very clear documentation, good troubleshooting (just added a basic diagnostic utility today and cleaned up the whole built in documentation + the whole github front page is re-written). I think I've spent about 14 hours on documentation over the last day or so. My next goal is to complete the JSON requests API for 3rd party apps + document. From there, potentially other TTS engines and things such as streaming.

erew123 commented 11 months ago

@Sascha353 @mercuryyy Let me ask you both a question as this also has considerations. Where would you both want the streaming output to be played? e.g. within Text-generation-webui's interface as it generates content? Over the API and back to your own player of some kind? Over the API and through a built in Python based player that runs within the AllTalk Python process?

mercuryyy commented 11 months ago

@erew123 First chocie would be "Over the API and back to your own player of some kind" sort of a live stream .wav I was playing with the TTS built in options, got it to somewhat work outside of alltalk, it is not bad at all.

Sascha353 commented 11 months ago

I tend to aim for the "best" option first and reduce the scope if needed, based on feasibility and resources. In my opinion, it would be best if audio output is send to TG-webui, as there is already a player in gradio, where the user can interact with the audio file/stream. This makes is generally more accessible and understandable for the user as no other output/player is introduced. I know the gradio player is capable of supporting a wav stream (utilized in the coqui soace). However I don't think the TG-webui is ready to receive & handle audio chunks as receiving and working with one full wav file coming from the tts-engine is obviously vastly different from handling a stream of incoming audio chunks. I described some of the challenges already in my FR here.

As a proof of concept and probably the easiest approach would be to stream directly using a library like sounddevice or PyAudio. In that case the tts extension should have auto play disabled so that the stream is not played by the streaming-feature and later again in the webui, after the full wav is generated and transferred.

It's just my opinion but I would not introduce another UI to control the stream. The user is working inside the webui and usability and immersion drops if you have to switch apps, windows, tabs etc.

erew123 / alltalk_tts

Real Time Streaming #6