erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
686 stars 71 forks source link

[Feature Request]: Package streaming End-to-End STT to TTS #218

Closed Katehuuh closed 2 months ago

Katehuuh commented 2 months ago

I’ve seen streaming TTS PR, and like the simple STT to TTS loop available in SillyTavern, it doesn't require any action from the user. I thought you could add my script whisper I’ve made a fast STT along TTS (alltalk_tts) combined or optional with my ooba extensions fast STT script, to make a package streaming End-to-End STT to TTS so that user can answer naturally without Record/Press enter like from the defaut whisper_stt extensions.

While it works fine with auto enter key workaround, I did not find way to Generate JS in Gradio streaming.

erew123 commented 2 months ago

Hi @Katehuuh

I have been considering adding whisper in to AllTalk in a couple of ways, so this could quite well fit into that :)

So let me just ask a couple of questions on this:

1) If I am understanding correctly, this would be an always on microphone scenario (or we could make it a checkbox for "keep the microphone on when this checkbox is selected), and you can just naturally interact via speech, with it auto submitting the STT generation back into text-gen-webui. Have I got that correct as a loose understanding?

2) I see you have tested on Windows, so I would need to test Linux? and if I can find someone who has a mac, I can get them to test.

3) As we arent using the Streaming TTS just yet (waiting to see if it gets approved) we may have to figure out how this all interacts. Im not yet sure how easy it is to stop/cancel the streaming TTS generation from Text-gen-webui. I am on AllTalk v2 building in a way of stopping TTS generation (if the text has already been sent to AllTalk) but no idea how Text-gen-webui can be sent a "stop sending the text over for TTS".

Where this all gets complicated is multi-threading requests within Python and access to the GPU cores. Meaning, that if the LLM is controlling all the tensor cores of a GPU, it may not be happy also trying to generate TTS in the cores at the same time.... Ill have to think on this and look at it when we can play with the streaming generation. I guess Im more just putting this number 3 in here for my own reference/thoughts when I get to look at this again.

4) Would you just be able to explain this a little bit more for me While it works fine with auto enter key workaround, I did not find way to Generate JS in Gradio streaming. Im assuming you are saying that this function:

def generate_transcribe():
    keyboard.send("enter")

Was the only way to commit the generated STT to the chat? I know that text-gen-webui has recently moved from Gradio 3.5.2 to 4.28 (I think) and so maybe there are some better options within that version. What would be the benefit of going to "Generate JS in Gradio streaming"

Sorry for the questions, Im just trying to get this fixed into my head! And thanks for offering your code! :)

Katehuuh commented 2 months ago

I’ve modified the default whisper_stt extensions to create ooba-insanely-fast-whisper, Suggest you do the same, start from the simple defaut whisper_stt in alltalk_tts for its simplicity and that I only have as a workaround.

  1. If I am understanding correctly, this would be an always on microphone scenario (or we could make it a checkbox for "keep the microphone on when this checkbox is selected), and you can just naturally interact via speech, with it auto submitting the STT generation back into text-gen-webui. Have I got that correct as a loose understanding?
  1. Mostly. “always on microphone” is part of Gradio: Real Time Speech Recognitio: from audio = gr.Audio(source="microphone") to audio = gr.Audio(source="microphone", streaming=True).
«loop» for multiple reapeating step:
- Using _Silero VAD_ `speech_prob` to detect `"silence"` when background noise or voice from **alltalk_tts TTS** - Using STT [insanely-fast-whisper](https://github.com/Vaibhavs10/insanely-fast-whisper?tab=readme-ov-file#how-to-use-whisper-without-a-cli) for speed-flash_attn_2 - Combined chunk with possibly cut-end sentence if not the last, If it’s the last chunk then `Generate`.
  1. I see you have tested on Windows, so I would need to test Linux? and if I can find someone who has a mac, I can get them to test.

  2. All modules are cross-platform and should work on Linux/Mac.

  3. Would you just be able to explain this a little bit more for me While it works fine with auto enter key workaround, I did not find way to Generate JS in Gradio streaming.

  4. Yes, The default whisper_stt use JS to click on Generate with gradio:

    audio.stop_recording(
    auto_transcribe, [audio, auto_submit, whipser_model, whipser_language], [shared.gradio['textbox'], audio]).then(
    None, auto_submit, None, js="(check) => {if (check) { document.getElementById('Generate').click() }}")

    by using streaming=True, I couldn’t use auto_submit:

    None, auto_submit, None, _js="(False) => { console.log('Check:', check); if (check) { document.getElementById('Generate').click(); }}");

    so for check I simply disable it (False), instead i opt for module keyboard, workaround not working for share=True shared URL link on other devices like phones, or if you just click away from the chat field..

erew123 commented 2 months ago

Hi @Katehuuh

Thanks for the reply. What Im going to do is put a link to this in the Feature Requests. Im so deep into working on v2 of AllTalk, I think its something I will try put in there as Im hoping to have a beta out soon.

Feature requests

I may well get back to you if I get stuck somewhere along the lines.

Thanks