evancohen / sonus

:speech_balloon: /so.nus/ STT (speech to text) for Node with offline hotword detection
MIT License
625 stars 79 forks source link

[Question] How to speed up sonus? #77

Closed abeulich closed 6 years ago

abeulich commented 6 years ago

Hi, first of all I'm very happy with how sonus works, but I'd like to make it a bit faster if possible. Thinking about it now I'm not sure if the "delay" I'm feeling doesn't stem from the processing time in the speech recognition cloud service?!

Anyway is there a way to make sonus consider an utterance as finished more quickly? In my use case I say the hot word, it gets recognized, I play a confirmation sound and after that I say a simple voice command for my home automation.

I have a feeling sonus waits a bit too long until it considers my voice command to be finished? Can this time out be shortened?

Thanks in advance, Alex

evancohen commented 6 years ago

Unfortunately this isn't something that's controlled by Sonus and is dependent on which cloud speech recognizer you use. In the case of Google Cloud Speech, we wait for the isFinal flag before sending final results.

One really hacky thing that you could try (that I wouldn't really recommend) is having a timer attached to your partial results and if partial results haven't changed (and recognized words are above a confidence threshold) you process those results.

Another option would be to have a "stop" word that once recognized could fire whatever interim results you have at the time - but again, this feels hacky to me.

You might be better off asking this question on the Google Cloud Speech repo.

abeulich commented 6 years ago

Hi Evan, many thanks for taking the time to comment on this. I understand it's very likely to be the processing of the captured audio and waiting for the recognition results that makes it feel a bit slow to me.

My understanding of the process (without partial results) is the following though:

  1. sonus waits for the hot word to be recognized offline by snowboy
  2. sonus records audio until it thinks I'm done speaking
  3. sonus sends the recorded audio to the voice cloud service
  4. result comes back
  5. sonus delivers final-result

I was hoping to save some time when my voice command gets recorded by sonus (Step 2). I don't understand how sonus detects that I'm done speaking. I guess it detects pauses and when the pause is long enough it's considered to be the end of what I wanted to say?

Is this assumption correct and is there a place where I could make sonus more "aggressive" when deciding to end the recording?

Many thanks again, Alex

abeulich commented 6 years ago

Just played with the partial-results and realized they are basically coming in when I'm still speaking. Therefore I guess audio already gets sent to the cloud service while I'm speaking (?) and I underestimated the optimization of sonus. :)

In my code I'm also looking at partial-results now (I was omitting them before) and for most voice commands it makes it faster than waiting for the final-result.

Still looking forward to any further comments, but you can also close this "issue" if you want.

evancohen commented 6 years ago

Correct, after a hotword is detected your audio is streamed to the cloud service - but only until the cloud service detects the end of the utterance (at which point the final results event is fired and audio stops being streamed).

Beyond improving the speed of the could speech recognizer there's not much that can be done without sacrificing accuracy.

abeulich commented 6 years ago

Thanks again. I'll keep trying to improve my code, that processes the results. It's faster already since I started to match partial results to what I'm looking for. :)