alphacep / vosk-server

WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries
Apache License 2.0
902 stars 243 forks source link

[websocket server] Sample rate impact : 8kHz vs 16kHz vs user-supplied #242

Open GuillaumeV-cemea opened 11 months ago

GuillaumeV-cemea commented 11 months ago

Hi,

I've been using vosk-server, specifically the websocket server with the dockerfile for a while now, using 16 kHz sample rate (I don't remember exactly why, to be honest). I'm looking into developping a web-extension to send raw audio data to the websocket server, and I've noticed most (if not all) of the examples are using 8 kHz sample rate.

Is there any benefit of using 8kHz instead of 16 kHz (or any other sample rate), as long as I supply kaldi's model with the correct sample rate, of course ?

I'm asking because the websocket server allow runtime configuration of sample_rate (by sending a config message), and from my limited testing this is working perfectly fine - for example, asking my browser to downsample user mic to 8kHz and sending it to vosk-server give me the same result as using whatever my browser base sample rate is (usually 48kHz) and sending it directly to vosk-server.

So if I can avoid any kind of client-side downsampling (which is difficult because only chrome does it natively, so I would have to come up with another solution for Firefox), and just send whatever input data I have to vosk-server, it would be much easier.

Cheers,

nshmyrev commented 11 months ago

For in-browser recognition it is much better to use webrtc server, it uses opus codec and much more responsive. 8khz significantly less accurate, if browser records wideband audio it is recommended to use wideband.

GuillaumeV-cemea commented 11 months ago

I avoided webRTC because it seemed much more difficult to setup (I'll need a STUN/TURN server if I understand correctly, and a bunch of port forwarding), and websockets seemed much easier to do (since I'm already using them in production).

What's the benefit of opus codec ? From what I understand, the webrtc server will have to transform this to wav before sending it to kaldi (since kaldi only work on wav format), so from a quality point of view it shouldn't be different.

What do you mean by much more responsive ? Delay between user talking and actual voice recognition ?

8khz significantly less accurate, if browser records wideband audio it is recommended to use wideband. If I understand correctly, it's better to record a high (as high as possible) sampling rate in browser, then send it directly to vosk-server, rather than downsampling it to 16 kHz (or whatever you chose for vosk-server) and sending it ?

Does the same apply for "regular" audio file ? For example, right now I'm parsing many kind of audio files (different format, different sources), and using ffmpeg to convert them to 16 kHz wav audio, then sending it to vosk-server. Would I benefit if I converted them to higher sampling rate wav audio (let's say 44.1kHz or 48kHz) and sending this to vosk-server ?

nshmyrev commented 11 months ago

Opus compress data, so instead of sending 1kb wav you send 100 bytes opus. Then it works over UDP, so it doesn't wait for packet round trip, if network ping latency is 100ms, you will have 200ms packet round trip delay.

Opus decoding is done within the Vosk server, you can check the code.

You don't need stun if your server is public, many services use webrtc like Zoom and others.

Vosk models are 16khz, you won't benefit from converting to 48khz sampling rate. In the future we might release 48khz models, then it will be better to send 48khz.

On Sat, Nov 4, 2023 at 5:32 PM GuillaumeV-cemea @.***> wrote:

I avoided webRTC because it seemed much more difficult to setup (I'll need a STUN/TURN server if I understand correctly, and a bunch of port forwarding), and websockets seemed much easier to do (since I'm already using them in production).

What's the benefit of opus codec ? From what I understand, the webrtc server will have to transform this to wav before sending it to kaldi (since kaldi only work on wav format), so from a quality point of view it shouldn't be different.

What do you mean by much more responsive ? Delay between user talking and actual voice recognition ?

8khz significantly less accurate, if browser records wideband audio it is recommended to use wideband. If I understand correctly, it's better to record a high (as high as possible) sampling rate in browser, then send it directly to vosk-server, rather than downsampling it to 16 kHz (or whatever you chose for vosk-server) and sending it ?

Does the same apply for "regular" audio file ? For example, right now I'm parsing many kind of audio files (different format, different sources), and using ffmpeg to convert them to 16 kHz wav audio, then sending it to vosk-server. Would I benefit if I converted them to higher sampling rate wav audio (let's say 44.1kHz or 48kHz) and sending this to vosk-server ?

— Reply to this email directly, view it on GitHub https://github.com/alphacep/vosk-server/issues/242#issuecomment-1793460562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWAYEHJ5XBSRYKJ4JA7SWTYCZGWXAVCNFSM6AAAAAA65SX4W6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGQ3DANJWGI . You are receiving this because you commented.Message ID: @.***>