Snowboy not detecting my wake phrase - could be because audio analyzed in chunks

troyibm commented 7 years ago

I have a wake phrase (not a single word but two words like "Hey Jackson") and I find that it is really hit or miss with it. In my testing, I am using the exact recording played from the snowboy web page. I think it is because the detection is done on a buffer size and if the wake phrase audio splits between buffers, then it won't be detected. That's my theory on why I will play the same audio over and over and it won't be detected on the first or second try.

I'm building this app using Swift and found that the example in this repo isn't all that helpful. The https://github.com/grimlockrocks/alexa-swift3-sample-app is a better example since I want to do something with the audio after the wake phrase is detected.

troyibm commented 7 years ago

@grimlockrocks you might have some insight into this problem or at least be able to tell me if I'm off-my-rocker on what I think is the problem.

grimlockrocks commented 7 years ago

@troyibm In my Alexa sample app, I am actually using KITT Swift 3 example: https://github.com/Kitt-AI/snowboy/tree/master/examples/iOS/Swift3/SnowboyTest. If you look at the implementation: https://github.com/Kitt-AI/snowboy/blob/master/examples/iOS/Swift3/SnowboyTest/SnowboyWrapper.mm#L38, it basically takes the audio array and compare it with the trained wake word audio array.

I never tested two-word phrase, but here are my thoughts that may help improve the accuracy:

When you set a timer to run wake word detection, set a longer period, e.g. 2 seconds, 3 seconds, which will increase the chance that your audio file contains all the keywords.
A better solution I think is to add another buffer array in the above .mm file, keep a rolling window of two buffer arrays, merge them together, and then call snowboy->RunDetection. This will help when "hey" is in the first audio file passed in, while "jackson" is in the second audio file passed in.

I cannot see the implementation of snowboy->RunDetection, the KITT team may help you understand how the data is analyzed.

Hope this helps!

chenguoguo commented 7 years ago

 I think it is because the detection is done on a buffer size and if the wake phrase audio splits between buffers, then it won't be detected.

That is not the case. Snowboy has an internal buffer which takes care of the hotword context. For end users, as long as you can feed the audio to Snowboy chunk by chunk, that should be enough. Internally we don't distinguish a single hotword or hotword phrases.

It's possible that the Swift 3 example doesn't cover some corner cases. Could you also give the Python demo a shot, just to make sure if the issue relates to the model or the example?

troyibm commented 7 years ago

Thanks for the reply Sheng. A colleague of mine suggested a ring buffer, which is what the python example does. Not sure how to implement that in Swift, but I haven't tried to find an example yet. https://en.wikipedia.org/wiki/Circular_buffer While a longer period would help find the entire phrase better, it will also give a pause to the user.

chenguoguo commented 7 years ago

Did you compile it under swig/Python? Also make sure you use swig 3.0.10 or above.

troyibm commented 7 years ago

works every time for the python version. Doesn't look like it is the pmdl file.

chenguoguo commented 7 years ago

So the problem is from the Swift3 example. I don't have much experience with Swift3, perhaps someone else with more experience can jump in?

troyibm commented 7 years ago

The Swift3 example it does the following:

sets a timer that fires every set number of seconds (I have set to 2)
when timer fires, it records to a file for some set number of seconds (I was using 1 second but Sheng suggested longer, so I am using 2 now).
After the recording is finished, it runs the contents of the file through snowboy->RunDetection. If result is 1, it stops the process. If result isn't one, it does nothing because the timer will fire again and start the process over.

That doesn't seem right at all.

I think it should always be recording and always be reading from the same buffer. (the circular buffer I mentioned earlier).

chenguoguo commented 7 years ago

Hmm, yes that's not optimal. If you are familiar with Swift3, it will be great if you can improve it by adopting some of the logics from the Python demo.

troyibm commented 7 years ago

I think if I was to goint to go through this exercise, I would start with https://github.com/raywenderlich/swift-algorithm-club/tree/master/Ring%20Buffer

This is stuff I'm not an expert at so it will take me some time and I might not succeed at all.

@grimlockrocks you want to take a crack at it? Seems the Swift example would be much more improved if it used the same technique as the python example.

troyibm commented 7 years ago

@chenguoguo does the Java/Android example for snowboy use the same logic as the Python demo?

chenguoguo commented 7 years ago

Yes.

Kitt-AI / snowboy

Snowboy not detecting my wake phrase - could be because audio analyzed in chunks #179