evancohen / sonus

:speech_balloon: /so.nus/ STT (speech to text) for Node with offline hotword detection
MIT License
627 stars 79 forks source link

Sonus in the browser #28

Open evancohen opened 7 years ago

evancohen commented 7 years ago

Continuing our discussion from TalAter/annyang#100

ghost commented 7 years ago

@evancohen I had some issues with sonus. Recognizing the "Jarvis" word was a bit of a hassle. Sometimes it did, sometimes it didn't, sometimes i just recognized it without anyone in the house saying anything (that was a funny moment, my girlfriend got scared of my little project when it started talking without being asked anything)

My setup is a rpi with a chrome opened (for SpeechRecognition, so i have unlimited cloud speech api keys). Sonus was on the same rpi, but running on linux. I successfully did a hotword detection, and started the SpeechRecognition in the browser. You just need to know that you cannot record from chrome and sonus at the same time, so i made a small websocket. When sonus detected a hotword (sonus stopped after detecting) it sent it to the websocket, then the browser knew that a hotword was detected and started the SpeechRecognition. After the speech stopped and processed the commands, it started sonus again via websocket. It think this is a really quick and dirty setup for "sonus in the browser".

In my opinion, sonus is ok, but processing speech still needs to be done in cloud for the moment.

evancohen commented 7 years ago

I like that approach, a few suggestions for you:

First, you can record multiple audio streams on the Pi with dsnoop - if I were you I would just use snowboy directly because you are already doing your streaming recognition in the browser, no reason to use Sonus in your scenario (although I wish I could get Sonus to the point that you could).

Second, if you are getting false positives I recommend playing with the recognition sensitivity. Also, short activation phrases tend to be more prone to false positives, so you could also try something like "hey Jarvis".

Anyhow... Getting truly "free" speech recognition is tricky - unless you are using Chrome (which you are) you're not getting it. A few other approaches I've taken in the past:

At the end of the day, if you're super dedicated to it being free you end up having to do a little extra legwork.

What I'd like to see is a snowboy keyword spotter that will run in the browser. Then I could write a simple wrapper to make it and webkitSpeechRecognition work well together. The ball is in the Kitt.ai court right now, let's see what they say.

In the meantime I'm going to see if I can write a keyword spotter that will work in the browser, then then use webkitSpeechRecognition for streaming recognition. That way it'll be easy to drop in snowboy if/when they choose to provide browser support.

Nixellion commented 7 years ago

@evancohen You suggested me to use JsSpeechRecognizer and now I see that I made a huge mistake disregarding it, I assumed it was also related somehow to pocketsphinx. But now I came back to it, and see that they have just what I need - keyword that you can train yourself, without any real recognition and phoneme stuff, that's not really needed for a simple task of recognizing just 1 or even a few keywords.

Will see how that works. Hopefull Ill be able to switch between it and chrome's speech recognition.

Thanks!

timaschew commented 6 years ago

@ghost I had the same idea. Did you change your implementation meanwhile?

Do you have some code snippets which you would like to share?

timaschew commented 6 years ago

@evancohen

First, you can record multiple audio streams on the Pi with dsnoop - if I were you I would just use snowboy directly because you are already doing your streaming recognition in the browser, no reason to use Sonus in your scenario (although I wish I could get Sonus to the point that you could).

But isn't there a chance that you loose some of the audio chunk?

Let's assume this is a timeline in x axis

speech:    random words until the hotword is triggerd snowbow, what is a false friend
snowboy:   ++++++++++++ listening +++++++++++++++++++++++++++++++++++++++++++X
browser:   ------------- waiting --------------------------------------------++++++++

In case snowboy needs some (X) time to realize that the hotword was spoken and the user continues speaking (maybe very fast) then the browser will start to listen to late and would only get the friend in this case instead of what is false friend.

evancohen commented 6 years ago

@timaschew I've got an experimental implementation that uses a ring buffer for audio on the audio-buffer branch to address that issue, I'm assuming you could create a similar implementation in the browser.

Right now I'm in Cambodia with some rather limited resources, but I'd like to help you get Sonus working. Can you file a separate issue with some repro steps to where you're stuck?

timaschew commented 6 years ago

I've got an experimental implementation that uses a ring buffer for audio

Ah nice.

Can you file a separate issue with some repro steps to where you're stuck?

Actually, I didn't get stuck, it were just some concerns I had.

evancohen commented 6 years ago

@timaschew happy to answer any questions (or concerns) you have! I am traveling at the moment, so I can't promise I'll respond instantly, but I will get back to you eventually :smile: