cmusphinx / pocketsphinx-android

pocketsphinx build for Android
BSD 2-Clause "Simplified" License
235 stars 130 forks source link

pocketsphinx takes a long time (.8 - 1.1 seconds) to send partial results on a wake word detection #25

Closed BeanStalka closed 7 years ago

BeanStalka commented 7 years ago

I am using pocketsphinx to spot one key phrase. I've reduced the dictionary to the two words that the phrase consists of.

Detection rates are excellent since I have dialed in my thresholds per documentation.

I am using Xamarin, so I have wrapped pocketsphinx in a Bound Library to gain access.

The problem is, that when I do get a detection it takes anywhere from .3 seconds (which i think is excellent) to 1.1 seconds (not so good).

I would like to get this time down to as short as possible, since i am switching from pocketsphinx to another service for the speech recognition.

I am aware that this time will never be 0, but I was hoping that maybe removing some of the files that are read in during StartListening() would help to reduce this.

Any suggestions are welcome, please see my attached implementation.

PocketSphinxWakeWordEngine.txt

nshmyrev commented 7 years ago

Calling Java from Xamarin is not a good idea probably, I would try to work with pocketsphinx through interop API instead.

BeanStalka commented 7 years ago

I have had no issues binding others via the Bound Libraries.

Do you have any other suggestions? Do I need to load the LM? Are there any file that I could avoid loading in the assets directory since I am using such a small subset of pocketsphinxs capabilities.

nshmyrev commented 7 years ago

Do I need to load the LM?

No, it is not needed

Are there any file that I could avoid loading in the assets directory since I am using such a small subset of pocketsphinxs capabilities.

Unlikely it affects your response time

BeanStalka commented 7 years ago

Do you have an example of someone using the interop API?

If not, thank you for your help and quick responses.

nshmyrev commented 7 years ago

There was a discussion here:

https://sourceforge.net/p/cmusphinx/discussion/help/thread/fb985d4d/

nshmyrev commented 7 years ago

Also, instead of switching quickly but I would try to just wait for the end of utterance and then if keyphrase is detected forward the whole chunk (you can get it with getRawData) to another service. Switch will be still distinguishable for the users no matter how fast you switch.

BeanStalka commented 7 years ago

Wow, that is a great idea.

That way if they say "Keyphrase", please turn on the lights. I would send that whole chunk for analysis.

I am assuming that I could call GetRawData in the ICMURecognizer.OnEndOfSpeech() hook.

I am using the SpeechRecognizer, does that expose the Decoder so that I can call GetRawData()

Any quick broad stroke example would be appreciated. You are the SME on this so any help you give would be amazing.

BeanStalka commented 7 years ago

FYI - I'm attempting to call GetRawdata, but the short[] is empty. MY Updates: 1.) OnPartialResults detected the wakeword and sets a flag 2.) OnEndOfSpeech check flag and stops the recognizer i am using .Stop(), should I use .Cancel()? 3.) OnResult checks flag (_keyWordDetected) and then calls GetRawdata

seems like Im missing something... PocketSphinxWakeWordEngine.txt

nshmyrev commented 7 years ago

Call d.setRawdataSize(300000) in decoder setup

BeanStalka commented 7 years ago

AWESOME

That works, I now have a full short array.

I am converting it to a byte array and will send it to BING.

My guesses at the format on the chunk from pocketsphinx: 1.) sample rate of 8000kz 2.) 16BitPCMFormat

Are these assumptions correct?

Thanks you again for all of your help so far! If this works its effectively alleviated the issues I was having with the user needing to pause after the wake word.

nshmyrev commented 7 years ago

You are welcome. Default sample rate is 16khz.

BeanStalka commented 7 years ago

I am having a heck of a time getting this to work.

Bing is expecting 16bit PCM format with a 16khz sample rate.

Would this be what the Decoder would supply? Everything I'm running across as far a documentation says that it must be a audio format issue.

UPDATE: Turns out i was not parsing the array correctly when sending it up to bing

attached please find the updated code that is working for me.

I cannot tell you how much I appreciate all of your help.

Большое спасибо Bal'shoye spaseeba

I owe you one.

BingSpeechToTextEngine.txt

BeanStalka commented 7 years ago

I have another question that I am hoping you can help me with.

if pocketpshinx is listening for a bit before I get raw results, the rawdata array gets rather large.

Is there a better way to manage this rawdata so that it only contains the audio immediately after the wake word till the end of the utterance.

Trim the front of the array as it were.

I was hoping there was a way to flush the array OnPartialResults when the wake word is detected. Or maybe I should just work backwards from the end of the array with timers.

I would love to hear your thoughts on this.

nshmyrev commented 7 years ago

end utterance and restart it again in every endOfSpeech

BeanStalka commented 7 years ago

UPDATE: I am using this and it seems to have solved the issue void ICMURecognizer.OnBeginningOfSpeech() { _pocketSphinxRecognizer.Decoder.EndUtt(); _pocketSphinxRecognizer.Decoder.StartUtt(); } Do you forsee any issues with this approach?

nshmyrev commented 7 years ago

It is ok.

BeanStalka commented 7 years ago

I could not find a place for the StartUtt and EndUtt that would trim up the buffer and give me the data array for what was said.

My suggestion above causes really unstable results.

Can you tell me exactly where I would need to place those calls to the methods above so that I can minimize the .SetRawdataSize(3000000) and also minimize the the buffer resetting in the middle of an utterance.

Or how I can reset the buffer on OnBeginningOfSpeech

nshmyrev commented 7 years ago

in onEndOfSpeech try to call recognizer.cancel() and recognizer.startListening().

BeanStalka commented 7 years ago

Unfortunately that wont work.

If there is enough silence to fill the buffer before I say the wake word, then the buffer will be silence only.

Once EndOfSpeech Is called, recognizer.cancel() and recognizer.startListening() will clear the buffer and begin listening again.

If the buffer is full before your utterance, your utterance is not captured by it.

I have begun using the timeout, and OnTimeout method. Stopping and Starting listening. This unfortunately give me a deadspot when the the recognizer stops and starts OnTimeout.

nshmyrev commented 7 years ago

Buffer is circular, it should contain only latest audio data.

BeanStalka commented 7 years ago

My goal is to capture everything in the buffer after wake word detection until end of utterance.

Given the fact that the buffer is circular, how would you suggest that I accomplish this feat?

Right now I am

1.)setting _junkIndexBeforeWakePhrase = _pocketSphinxRecognizer.Decoder.GetRawdata().Length; during the OnPartialResult call back when the wake phrase is detected.

2.) OnEndOfSpeech I am getting the buffer, and slicing out anything before _junkIndexBeforeWakePhrase

3.) When the buffer gets full I cancel then start the recognizer.

nshmyrev commented 7 years ago

I don't think you need to slice, you can just use buffer as is, it contains something like last several seconds of audio and you can feed them into recognizer.

BeanStalka commented 7 years ago

Nickolay, I ran a test and do not believe that the buffer is implemented as a circular buffer.

Is there a setting that I need to flip to get this behavior?

I made sure my utterance was timed with the end of the buffer. If it was circular, you would expect half the utterance to be at the end of the buffer and the other half to be at the front of the buffer as it wraps around.

This is not the case. I took the entire array of raw data, and saw the first half of my utterance at the end of the file, but nowhere did I see the second half of the utterance.

Please advise

Andrew Glatts

Sr. Software Engineer

P: (610)-999-6993

On Sun, Jun 18, 2017 at 1:00 PM, Nickolay V. Shmyrev < notifications@github.com> wrote:

I don't think you need to slice, you can just use buffer as is, it contains something like last several seconds of audio and you can feed them into recognizer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cmusphinx/pocketsphinx-android/issues/25#issuecomment-309289627, or mute the thread https://github.com/notifications/unsubscribe-auth/ACII54HExTTGumnU4ALpC1ajCMwwrNRoks5sFVesgaJpZM4NlGZJ .