Add `stream=` kwarg to `Recognizer.listen`

clusterfudge commented 5 months ago

Support for receiving captured audio one chunk at a time, while continuing to use the wakeword and audio energy detection code.

Notably, Coqui.ai/DeepSpeech (the python STT package) support a streaming interface, which greatly improves interaction latency for continuous listening applications. Even for non-streaming interfaces, this implementation allows for eager encoding (for example converting to numpy buffers, or even precomputing transformer KVs), or just an earlier start to transmission (when using websockets or other chunked transfer mechanisms).

Note: This is a minimal extraction from a larger edit I have in a side project. There, I ended up carving up huge chunks of recognizer to make it a bit more observable (i.e. trigger events based on speech detection start/stop aside from yielding audio, as well as real-time events for audio-energy threshold and detected value). This is a much smaller edit, but I have not vetted it as well. I am in the process of adopting this change directly into a new project leveraging self-hosted whisper over http.

Guillermoreno commented 5 months ago

Thanks mate you just made my day! Can I support in any way with this? I'm new to this things, but I made your modifications locally and now I can stream the audio into AWS transcribe and save precious response seconds on my AI voice agent!

clusterfudge commented 5 months ago

Can I support in any way with this?

Just looking for a maintainer's attention at this point, I think!

clusterfudge commented 4 months ago

cc @Uberi 🔔

Please take a look, and let me know if you have any questions or feedback! I've been running this on ubuntu, macos, and raspbian for the last couple weeks and would love to get off the fork!

Uberi commented 3 months ago

Hey @clusterfudge!

Looks great, thanks for the detailed writeup and testing!

Uberi / speech_recognition

Add `stream=` kwarg to `Recognizer.listen` #757