espressif / esp-va-sdk

Espressif's Voice Assistant SDK: Alexa, Google Voice Assistant, Google DialogFlow
Other
286 stars 82 forks source link

[aia] Start conversation with WW event instead of TAP_TO_TALK doesn't work #95

Open albanoandrea opened 4 years ago

albanoandrea commented 4 years ago

Hi, we are building our custom device using aia_beta branch.

When WW is recognized, amazon asks to send the audio to the server from a starting point before the Alexa WW is detected, so that the server can verify the WW to and decide if "Alexa" was good enough and maybe not commercial on TV.

I've noticed that in the state machine in va_dsp, there is the WW event that should make the work, but the phrase_length is missing.

So I store somewhere the ww length and when requested I give it to the state machine using the get_ww_length e function:

                    case WW: {
                        size_t phrase_length = get_ww_length();
                        if (phrase_length == 0) {
                            /*XXX: Should we close the stream here?*/
                            break;
                        }
                        if (va_dsp_data.va_dsp_recognize_cb(phrase_length, WAKEWORD) == 0) {
                            struct dsp_event_data new_event = {
                                .event = GET_AUDIO
                            };
                            xQueueSend(va_dsp_data.cmd_queue, &new_event, portMAX_DELAY);
                            va_dsp_data.dsp_state = STREAMING;
                        } else {
                            ESP_LOGE(VA_DSP_TAG, "Error starting a new dialog..stopping capture");
                            _va_dsp_stop_streaming();
                        }
                        break;
                    }

And here the conresponding log:

I (290754) init: WuW ALEXA received
I (290754) va_dsp: Sending start for ww command
[speech_recognizer]: New recognize request: WW length 12160
[dialog]: Dialog new
[speech_recognizer]: On Focus: Foreground
[dialog]: Entering VA_LISTENING
I (290774) [sys_playback]: Acquire
I (290774) [app_va_cb]: Dialog state is: 4
[289 seconds]: [http_transport]: Event data: {"events": [ {"header" : { "name": "MicrophoneOpened", "messageId":"0a800450-3eeb-50f1-ccb6-c93412d6b16b"}, "payload" : {"profile":"NEAR_FIELD","initiator":{"type":"WAKEWORD","payload":{"wakeWord":"ALEXA", "wakeWordIndices":{"beginOffset":102720,"endOffset":106880}}}, "offset":94720}}]}
[290 seconds]: [http_transport]: Free Memory Internal: 82452, External: 2932632
[directive_proc]: Json data: {"directives":[{"header":{"name":"CloseMicrophone","messageId":"a2bf1854-26b1-4b54-9080-11d4c1e60e4d"}}]}
[dialog]: Listen end
I (291984) va_dsp: Sending stop command
E (291984) va_dsp: Event 3 unsupported in STOPPED state
[292 seconds]: [http_transport]: Free Memory Internal: 82452, External: 2938892
[directive_proc]: Json data: {"directives":[{"header":{"name":"SetAttentionState","messageId":"44d8f713-604a-42bb-af22-69854fcbafb6"},"payload":{"state":"IDLE"}}]}
[dialog]: Stream finished
I (294124) va_dsp: Sending stop command
E (294124) va_dsp: Event 5 unsupported in STOPPED state
[dialog]: Entering VA_IDLE
I (294134) [sys_playback]: Release
I (294144) [app_va_cb]: Dialog state is: 8

When we call the TAP_TO_TALK event this is the log of a call:

I (220448) init: WuW ALEXA received
I (220448) va_dsp: Sending start for tap to talk command
[speech_recognizer]: New recognize request: WW length 0
[dialog]: Dialog new
[speech_recognizer]: On Focus: Foreground
[dialog]: Entering VA_LISTENING
I (220468) [sys_playback]: Acquire
I (220468) [app_va_cb]: Dialog state is: 4
[218 seconds]: [http_transport]: Event data: {"events": [ {"header" : { "name": "MicrophoneOpened", "messageId":"6ed4a625-b857-567e-9ce5-9cff10cfd092"}, "payload" : {"profile":"NEAR_FIELD","initiator":{"type":"TAP"}, "offset":381760}}]}
[221 seconds]: [http_transport]: Free Memory Internal: 82428, External: 2931120
[directive_proc]: Json data: {"directives":[{"header":{"name":"CloseMicrophone","messageId":"e71ec15a-76b2-4d03-8b42-943a5425cbc1"}}]}
[dialog]: Listen end
I (222928) va_dsp: Sending stop command
E (222928) va_dsp: Event 3 unsupported in STOPPED state
[221 seconds]: [http_transport]: Free Memory Internal: 82428, External: 2937384
[directive_proc]: Json data: {"directives":[{"header":{"name":"SetAttentionState","messageId":"bc43fa4f-020c-4ac3-871b-30dd01e93b1a"},"payload":{"state":"THINKING"}}]}
[dialog]: Response 200
I (223058) va_dsp: Sending stop command
E (223058) va_dsp: Event 5 unsupported in STOPPED state
[dialog]: Entering VA_THINKING
..
..
..

It seems to me that the server always reject the request, while with tap to talk it Thinks and then give the answer.

Do you have ever used that feature and you know it's working? There is some advice we should follow in order to use it? (e.g. we pass as phrase_length the duration in number of samples estimated by the WuW engine, but maybe we need to pass the size in byte).

Thanks, Andrea

avsheth commented 3 years ago

Hi @albanoandrea Sorry about the late response. The solution available on GitHub is for evaluation purpose and uses Espressif's custom WW, which is not certified by Amazon. Usually in a commercial product, we get this WW length from an external DSP (Amazon certified). Hence here we are not sending WW length as it will anyway be rejected by the cloud. DSP's certified firmwares requires an NDA.

albanoandrea commented 3 years ago

Hi Amit, don't worry, we have addressed and fixed the issue with your team.