espressif / esp-va-sdk

Espressif's Voice Assistant SDK: Alexa, Google Voice Assistant, Google DialogFlow
Other
290 stars 82 forks source link

about "play song" and "sing a song" #66

Open DuHeLong opened 4 years ago

DuHeLong commented 4 years ago

Hello, when using the Alexa SDK, I encountered two problems: 1: when I say "play song", I only know the state of VA_listening, VA_thinking, VA_idle, but I can't know the state of the beginning and end of playing music. Can you tell me? 2: when I say "sing a song", the playback is smooth and unimpeded; But when I say "play song", it is not continuous when playing songs. If I execute va_dsp_mic_mute (1) now, the playing will be smooth. I noticed that: "sing a song" is in MP3 format, "play song" is in aac codec format. Please tell me why?

avsheth commented 4 years ago

Hi @DuHeLong For 1. Audio playback events are handled internally by the SDK and aren't really exposed to the application yet. Basically, all dialog related events have been brought out so that anyone with custom hardware can use them to drive their LEDs or speakers. And this is because Alexa specification mandates some UI for these events. Alexa specs doesn't yet mandate anything for Audio playback specific events, that may require any app or board specific handling. Could you please tell if you have any specific requirements for these events?

For 2. aac decoding is more CPU intensive as compared to mp3. And since wakeword detection also runs on the host, it may not be getting enough CPU cycles to play the song smoothly.

DuHeLong commented 4 years ago

Thanks, For 1. Our application needs this event state.Can you release this event state for us?

DuHeLong commented 4 years ago

For 2, there are two major functions of Alexa, one is dialogue, the other is playing songs. But we have tested that if we use the word "play", the format we get is aac format, and the playback is not smooth, then the user experience is very bad. When you use the word "play" to test, do you also get aac format or something else?

vikramdattu commented 4 years ago

Hi @DuHeLong the solution here is a prototype. The board is handling all the WW detection locally and that consumes considerable amount of CPU.

Stutters in case of AAC format playback are due to the fact that AAC needs far more CPU than MP3 and hence stutters in this case.

As far as final product is concerned, it is expected that you have separate h/w for WW detection. In that case playback should be smooth in all the cases.

DuHeLong commented 4 years ago

Hello, thank you for your support. To solve this problem, we want to use two esp32 to solve this problem. We only use one esp32 as wake-up function. However, we found that when running ESP_VA _SDK, we must connect to the network to enter wake-up mode. Is there any way to solve this problem when there is no network?

avsheth commented 4 years ago

Hi @DuHeLong Please note if you're looking to create an Alexa built-in product using ESP32, you need to use one of the acoustically certified DSPs, as mentioned here. Lyrat- based solutions are not certified and are only for prototyping or creating PoC. If you need further assistance on this you can reach me at my email ID.