alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
882 stars 243 forks source link

Support opus in websocket #181

Closed StuartIanNaylor closed 2 years ago

StuartIanNaylor commented 2 years ago

Websockets is a pretty good carrier for a mic stream as its easy to to distinguish binary audio data and what might be any text meta / control data as its wonderfully simple for that. One thing though as it would be a relatively easy bolt on after chunking audio is to encode and decode with a compression codec. Opus comes to mind and maybe parameters if or not raw PCM audio is sent but if you do employ distributed mics then raw audio bandwidth adds up pretty quick.

nshmyrev commented 2 years ago

For mic-like apps we actually recommend webrtc server, it is much more responsive and it already supports encoded media (opus)

StuartIanNaylor commented 2 years ago

Yeah webrtc is great has all the meta channels built in but could be a bit 'fat' for microcontroller platforms. The newer esp32-s3 if you look at what they have done with https://github.com/espressif/esp-box its actually pretty capable to run a KWS they do even more but the ASR/TTS I think a better central server is a better idea. Currently esp32-s3 is new and really hasn't hit economies of sale so no bother as the much more capable PiZ2 is my preferred platform. Maybe we might get more support for webrtc on esp32 as with that it wasn't opus but amr-wb as it already supports an encoder. I actually want to run https://github.com/badaix/snapcast or shairplay for audio out delivery so that my smart assistant is part of a wireless audio system and interoperable which runs on Pi so well. So a port of that might never happen also to esp32-s3 so might never be a thing for me anyway.

I am currently looking at developing wireless distributed mics (arrays) that works in a zone system that is a reflection of how wireless audio would be organised. Been thinking for a while that distributed mics should be like any HMI (keyboard,screen) and agnostic of central servers but have a bridge client/server to pass audio on. So really Vosk will never see the websockets on the esp32 just the server side connection of the distributed mic/kws system but just saw the example for websockets and noticed it didn't have a codec which is relatively easy to implement and thought I would mention.