alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 342 forks source link

EoS (End of speech) handling #205

Open perfectlegato opened 4 years ago

perfectlegato commented 4 years ago

We currently deploy our own customized ASR that recognizes single-word vocabularies for our client.

On the client side we have deployed our own SDK along with webrtc-vad (from chromium) to detect EoS, and send it over to our server. (So we have client-side VAD rather than server-side VAD)

But this VAD is really not that stable for production level so its performance is quite inconsistent.

In the case where it doesn't detect an EoS, obviously the socket connection keeps running and eventually times out.

But if you look at the decoding results in partial results, the ASR is actually decoding correctly with very high accuracy, just that it never got the EoS so it couldn't complete decoding and sends final results.

Would it better to:

1) Just treat the no EoS as an error type and return that to the client? or 2) Put in an artificial EoS timer. So if the server side doesn't detect any EoS after a certain period (say 3 secs), we tag an EoS automatically. But this actually would require some change on client's android app because the microphone and listener are not closed properly sometimes.

Note: We are not using Kaldi's endpoint detection because we have found that it doesn't work well on short utterances (single word vocab, e.g. "apple", "brown", "banana"....etc)

Would love to hear others' opinions on handling EoS. Thank you in advance

luvwinnie commented 4 years ago

hello, do you figure out how to handle the decoder? I have a same problem that client already silent by the vad, the decoder still waiting for streaming from client, it took time to get the final_result. Do you know any way to make sure that the decoder have a final result instantly?

perfectlegato commented 4 years ago

Yeah so the decoder would only produce final result when it receives the EoS.

If your VAD is not functioning well, then you have to do it artificially.

You can implement a conditional manual on a certain timer - based on how long you expect the user to speak. For example, if you don't generate any EoS within 5 seconds, then you send an to the decoder to received the finalresults

So this depends entirely on your use case scenario

On Wed, May 13, 2020 at 11:51 PM Leow Chee Siang notifications@github.com wrote:

hello, do you figure out how to handle the decoder? I have a same problem that client already silent by the vad, the decoder still waiting for streaming from client, it took time to get the final_result. Do you know any way to make sure that the decoder have a final result instantly?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alumae/kaldi-gstreamer-server/issues/205#issuecomment-628042674, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFJYQDMQSQ2PSBZLUPXQ7LRRKXV7ANCNFSM4I2UUHAQ .

luvwinnie commented 4 years ago

@lonelyspoon thank you for reply!!, I use the webrtc to detect speech, and for the decoder, if receive like 2secs of silent then send an to end the decoder to get result. But at here I have another problem, if I don't speak anything. That means my VAD will keep sending the silent. Is the decoder should be silent too in order to get silent token like "<"sp">" ? Currently, although I keep send silent to the decoder, the decoder sometime decode like filler or something.

I hope that the final result of my decoder is "<"sp">" if only the silent is being input to the decoder. Is that got anyway to do so?