"Alexa style" continuous speech instruction

mph070770 commented 8 years ago

Hi - great software!

I have your demo working with Ubuntu. What I'd like to do is detect the keyword in continuous speech in a similar way to the Amazon echo. Is that possible? For example, this:

"Alexa, turn on the lights"

instead of

"Alexa" [ding] "turn on the lights"

Ideally, I'd also want to know where in the audio the keyword was spoken so that it can be removed from audio before I send it to an online engine (such as api.ai or AVS).

Any suggestions would be great.

Thanks

xuchen commented 8 years ago

The [ding] sound is actually a callback function you can define yourself. Here's an idea:

keep an audio buffer and a global variable is_triggered = False
when triggered, set is_triggered = True in your callback
send any audio after this point in your buffer to AVS for speech recognition.

Does it make sense?

chenguoguo commented 8 years ago

What Xuchen said was correct. You may have to play with the audio buffer a little bit, to make sure you send all the audio after hotword detection to the ASR.

Guoguo

On Sat, May 14, 2016 at 1:01 AM, xuchen notifications@github.com wrote:

The [ding] sound is actually a callback function you can define yourself. Here's an idea:

keep an audio buffer and a global variable is_triggered = False

when triggered, set is_triggered = True in your callback

send any audio after this point in your buffer to AVS for speech recognition.

Does it make sense?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/Kitt-AI/snowboy/issues/1#issuecomment-219101033

chenguoguo commented 8 years ago

Looks like it has been resolved, so closing this.

mph070770 commented 8 years ago

Thanks for the feedback. Are you suggesting a new audio buffer or utilising the ring buffer?

chenguoguo commented 8 years ago

Re-opening this since there's on-going discussion... Let me write in more details.

In order to remove the [ding] sound, you only have to modify the callback function as Xuchen said. You do not need another buffer. If your ASR server does online decoding, then you can start transmitting your audio data to the server right after the triggering of the hotword.
You may need another buffer if: 2.1. there is a delay in hotword detection. In this case, you need a buffer to keep some data before the triggering of the hotword, so that you will have a "complete" sentence for your ASR. 2.2 your ASR server can only do offline decoding. In this case you need a buffer for the whole sentence after the triggering of the hotword. You will have to detect the end of the sentence (I can explain more on this if necessary), and then send the whole sentence to your ASR server (this may not be your case).

Does this solve your problem?

chenguoguo commented 8 years ago

Closing this as it has been integrated into AlexaPI. See:

https://youtu.be/wLbsAQDmN-c

https://github.com/sammachin/AlexaPi/pull/85

jwhite commented 7 years ago

I don't think is closed. This issue is related to continuous detection using a buffer. Alexa-Pi only uses the hotword record method at this time as far as I can tell.

chenguoguo commented 7 years ago

OK re-open it. What I suggested above should still stand.

dmc6297 commented 7 years ago

I did this by customizing the snowboy_index.js. In the processDetectionResult function I set a "command" flag once the hotword is detected and emit all chunks until silence is detected. Another script builds a buffer from all the chunks and sends them to Microsoft LUIS for recognition.

So you can say "Alexa turn off the lights" all in one phrase without pausing.

_write(chunk, encoding, callback) {
var parent = this;
    const index = this.nativeInstance.RunDetection(chunk);

    this.processDetectionResult(index, chunk);
    if(parent.bufferingCommand == true)
    {
        this.emit('chunk', chunk, encoding);
    }
    return callback();
}

evancohen commented 7 years ago

@dmc6297 you might want to check out Sonus. There's an implementation on the audio-buffer branch which uses a ringbuffer + stream transformation (basically what @chenguoguo described in this thread).

The only drawback with my ring buffer implementation is that it doesn't perform super well on low powered devices (Like the Pi Zero, where detection lag increases by about 1/3 of a second).

Stan92 commented 7 years ago

Hi,

I'm looking for something like this too using nodejs but less sophisticated :-)

@evancohen, I've seen your project it seems it could probably satisfy my needs (except for MS Cognitive Services).

There are several steps that I can manage using 2 "audio buffers" (one for snowboy, one for Bing). But I think I'm not on the good path.

This is the workflow I'd like to implement. I have several hotwords

For local actions (Time, Light, etc...)
1 for activating online action (Go Online)
1 for stopping online action (Bye)

a) if it's "Time, Light,..." then I run my "local action" b) if "Go Online" is detected then I say to the user I'm listening

c.1) if the word/sentence doesn't not exist within Snowboy Model and I'm in "listening mode" I would like to send the word/sentence online (using MS Cognitive Services).

c.2) if the word/sentence exists within the Model and I'm in "listening mode", I don't want to send the data online.

d) if it's "Bye", any word/sentence will be sent online until the user says "Go Online" e) When a silence of x seconds is detected, I need to back "offline" (means any word/sentence will be sent online until the user says the "Go Online"

Stan92 commented 7 years ago

@dmc6297 I tried your customized snowboy_index.js but it doesn't work for me. When I save the chunk into a buffer, the final file (I concatenate the buffer into a array of bytes), the wav file is inaudible.

    detector.on('chunk', function (chunk, encoding) {
        if (chunk){
            buffers.push(chunk);
            if ((new Date()-timeStart)/1000 > timerInSecond ) {
                detector.bufferingCommand=false;
                getText(buffers); 
            }
        }
    });

The getText transforms the buffer into an array of bytes and sends it to an api var bytes = Buffer.concat(buffers);

Could you please give me a hand? Thanks

dmc6297 commented 7 years ago

@Stan92 The data is pcm audio, you will need to prepend a wav header to the buffer, or convert to another format. This is how I made it work.

Start the command buffer

detector.on('commandStart', function (hotwordChunk) { audioCommandBuffer = new Buffer(5000);

var samplesLength = 10000;

var header = new Buffer(1024);
header.write('RIFF',0);

//file length
header.writeUInt32LE(32 + samplesLength * 2,4);
header.write('WAVE',8);

//format chunk idnetifier
header.write('fmt ',12);

//format chunk length
header.writeUInt32LE(16,16);

//sample format (raw)
header.writeUInt16LE(1,20);

//Channel Count
header.writeUInt16LE(detector.numChannels(),22);

//sample rate
header.writeUInt32LE(detector.sampleRate(),24);

//byte rate
//header.writeUInt32LE(detector.sampleRate() * 4,28);
header.writeUInt32LE(32000,28);

//block align (channel count * bytes per sample)
header.writeUInt16LE(2,32);

//bits per sample
header.writeUInt16LE(16,34);

//data chunk identifier
header.write('data',36);

//data chunk length
header.writeUInt32LE(15728640,40);

audioCommandBuffer = header.slice(0,50);

//Comment this out to omit the hotword chunk of audio
audioCommandBuffer = Buffer.concat([audioCommandBuffer,hotwordChunk]);

});

Append to the buffer

detector.on('chunk', function (chunk, encoding) { audioCommandBuffer = Buffer.concat([audioCommandBuffer,chunk]); });

And to output the buffer to a file

detector.on('commandStop', function () { fs.writeFile('/home/pi/Speech/audio.wav',audioCommandBuffer); });

Stan92 commented 7 years ago

@dmc6297 ... I don't know how to thank you... :-).. I'll make a try asap Thanks once again

zikphil commented 7 years ago

Hey you guys, I think this thread is exactly what I am trying to do but in Python. On top of being able to say the full sentence without stopping, I'd also like the capability to keep a 3seconds buffer before HWD kicks-in so I can say stuff like "Goodnight Snowboy". or "What do you think Snowboy" through Google Speech API. Any suggestions on how to achieve that?

chenguoguo commented 7 years ago

As you said you can maintain a buffer before the hotword, and when the hotword is detected, you send the buffer to Google Speech API, and see if there's anything meaningful there.

sintetico82 commented 7 years ago

Someone can write an example for nodejs?

evancohen commented 7 years ago

https://github.com/evancohen/sonus/tree/audio-buffer ^ This branch has an example that uses a ring buffer

On Sun, Oct 22, 2017 at 1:43 PM sintetico82 notifications@github.com wrote:

Someone can write an example for nodejs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Kitt-AI/snowboy/issues/1#issuecomment-338507629, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJJHWbG6Ak6tQgz072bXlxlVwIRIyFiks5su6j7gaJpZM4Id35U .

--

//mobile

uchagani commented 6 years ago

@zikphil Were you able to get this working? I am trying to do the same thing. Any help is appreciated. thanks.

Kitt-AI / snowboy

"Alexa style" continuous speech instruction #1