QuantiusBenignus / blurt

Gnome shell extension for accurate speech to text input in Linux using whisper.cpp. Input text from speech anywhere.
https://extensions.gnome.org/extension/6742/blurt/
GNU General Public License v3.0
54 stars 4 forks source link

Feature request: Wyoming API #4

Closed ser closed 9 months ago

ser commented 9 months ago

It would be cool if extension could communicate with local Faster Whisper via Wyoming protocol API:

https://github.com/rhasspy/wyoming-faster-whisper

The advantage is that voice recognition could work on cheap gnome clients with one more capable machine in the local network.

QuantiusBenignus commented 9 months ago

Hi @ser, thanks for the suggestion. The idea is nice but it seems that this is somewhat of an edge user case. Hardware like Raspberry PI and such do have GNOME but are often run headless, with no mouse, keyboard etc. And while I have a mic hat on mine, they typically do not have a built-in microphone. I looked at the protocol and while I understand why it is being used, I think there are other, leaner options for blasting audio data over a LAN to a (server or not) instance of [faster]whisper[.cpp] and then getting the text result back. Maybe RTP or other low-latency, lightweight approach.

Since the idea behind this very simple extension is to remain such, I would rather not add features that IMHO, will see limited use. Still, I will keep this issue open for some time and put some thought into a possible lightweight solution.

Actually, for this use case, I would recommend starting from something like cliblurt which uses minimal resources (GUI is optional) and is not GNOME only (should work under XFCE4 for example.)

ser commented 9 months ago

These are not only Pis, 90% of my computers, older PCs or laptops are unable to handle speech recognition in sensible time. So in other words, do you plan to add any API or you are decided to keep everything local?

BTW this local stack is very complex to be honest, making use of server-client architecture would simplify things a lot, even on the same machine.

QuantiusBenignus commented 9 months ago

Valid points. It may not be such an edge case after all. I am going to create an option to choose between a local whisper.cpp and sending the audio data to a server for transcription. This will be a call to a whisper.cpp server simply because the data-transfer format is simpler.

If you would like, you can then use that as a base to craft an appropriate "multipart/form-data" curl request to conform to the Wyoming protocol and call the referenced faster-whisper server.

Setting up this little hack will likely remain complex since it is not a monolithic app, but rather uses the built-in tools and flexibility of the Linux system. An installation script will help automate things a bit, will see.

QuantiusBenignus commented 9 months ago

Hi @ser, the extension can now be set up to transcribe over the network using a whisper.cpp server. Please, see here for details.

Talking to a faster-whisper server should be possible to implement in a similar fashion. With a lot more work, this can of course be all written in GJS to work from GNOME shell, but it will waste a lot more CPU cycles and memory. The command line shell remains unbeatable for speed and flexibility.

ser commented 9 months ago

fantastic!!!!! i am investigating now how much resources would take whisper.cpp server additionally to current fast whisper.

ser commented 9 months ago

So finally I decided to write Wyoming server also using Whisper API to avoid necessity of having two STT services, https://github.com/ser/wyoming-whisper-api-client