Continued-conversation for esp32-s3-box-3

jaymunro commented 6 months ago

Adds the following features to the s3-box-3 firmware:

Continued conversation (not having to say the wake word for every command). Default: Off
A user adjustable timeout to return back to idle and waiting for the wake word. Default: 8s
Display live Time and/or Date in various formats on the Box3 display. User selectable dropdown. Default: None
Display prompts, user query as understood (STT), the spoken response (TTS) on the Box3 display. Default: Off
User customisation of the on-screen prompts to any phrase or language.
Sensors in HA to show the user query and Assist response for use in automations and troubleshooting.

A slightly old video (prior to the recent update adding conversation display) of the system showing the continued conversation ability is at: https://drive.google.com/file/d/1DjV5XPmsqwHq7iph_kFEb6XFILpzw4Pt/view?usp=sharing

jaymunro commented 6 months ago

I have done a fair amount of testing but will be looking for more people to try this draft before marking as ready for review. Also hope to make a short video to demo the features and how well it works. Continued conversation works so well, it is almost like a natural conversation.

vtolstov commented 6 months ago

Does it possible to do something like this on atom echo?

jaymunro commented 6 months ago

Does it possible to do something like this on atom echo?

Could be possible, but needs to be done by someone that has the hardware. Give it a go.

jaymunro commented 6 months ago

Please note this code is still in testing and still has a lot of debug logging code present which may drain the s3 resources a bit more than usual. As more people confirm it works for them I can start to remove some of that debug code to improve efficiency.

I still yet have a final feature request to complete - multi-language ability for the on screen text, but this will be done with just a few substitutions so should not impact performance.

jaymunro commented 6 months ago

To use the new facility to modify the text prompts to your own language add this to your device's yaml:

substitutions:
  ...
  starting_up: "Starting up..."
  wake_word_prompt: "Say \\\"Hey Jarvis\\\""
  listening_prompt: "How can I help?"
  error_prompt: "Sorry, error"

Change the text to your desired wording or remove the text between the quotes if you want nothing on that particular page. Note the method of adding a double quote \\\". Single quotes can be added without escaping.

jaymunro commented 6 months ago

Updated the main description and screenshot for this PR with all the changes added above.

jaymunro commented 6 months ago

Question for the maintainers: Should this be merged into the folder "wake-word-voice-assistant" or is it better that a third version of "esp32-s3-box-3.yaml" is created in another folder such as "wake-word-voice-assistant-continued-conv"? I was hesitant to add a third version as too many may create confusion not to mention more versions to maintain.

emanuelbaltaretu commented 6 months ago

Looks great, can't wait to get my hands on hardware and hopefully by then it's merged into master

DrShivang commented 6 months ago

I am waiting for it to be merged, looks promising will test and leave feedback upon completion Thanks

jaymunro commented 6 months ago

I'll have a chance to work on those new conflict tomorrow. Looked over them and they look simple.

jaymunro commented 6 months ago

Added variable (based on text width) width text outlines to the other prompts added in this pull to match the conversation boxes added by @jlpouffier Added user configurable outline color for the new text via a substitution 'text_outline_color'

DrShivang commented 6 months ago

@jaymunro, its working great .. just one small issue am facing In Home Assistant Wake Word Detection isn't working as intended. On device does.. Sharing the logs for the same.

[D][esp-idf:000]: I (109894) AUDIO_PIPELINE: Pipeline started

[W][component:232]: Component voice_assistant took a long time for an operation (236 ms). [W][component:233]: Components should block for at most 30 ms. [D][esp_adf.microphone:273]: Microphone started [D][voice_assistant:416]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE [D][select:062]: 'Wake word engine location' - Setting [D][select:115]: 'Wake word engine location' - Set selected option to: In Home Assistant [D][select:015]: 'Wake word engine location': Sending state In Home Assistant (index 0) [W][component:232]: Component script took a long time for an operation (237 ms). [W][component:233]: Components should block for at most 30 ms. [D][voice_assistant:523]: Event Type: 11 [D][voice_assistant:677]: Starting STT by VAD [D][voice_assistant:523]: Event Type: 12 [D][voice_assistant:681]: STT by VAD end [D][voice_assistant:416]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [D][voice_assistant:422]: Desired state set to AWAITING_RESPONSE [D][esp_adf.microphone:234]: Stopping microphone [D][voice_assistant:416]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [D][esp-idf:000]: W (123945) AUDIO_ELEMENT: IN-[filter] AEL_IO_ABORT

[D][esp-idf:000]: E (123947) AUDIO_ELEMENT: [filter] Element already stopped

[D][esp-idf:000]: W (123979) AUDIO_PIPELINE: There are no listener registered

[D][esp-idf:000]: I (123981) AUDIO_PIPELINE: audio_pipeline_unlinked

[D][esp-idf:000]: W (123981) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE

[D][esp-idf:000]: I (123985) I2S: DMA queue destroyed

[D][esp-idf:000]: W (123985) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE

[D][esp-idf:000]: W (123987) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE

[W][component:232]: Component voice_assistant took a long time for an operation (239 ms). [W][component:233]: Components should block for at most 30 ms. [D][esp_adf.microphone:285]: Microphone stopped [D][voice_assistant:416]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [W][component:232]: Component script took a long time for an operation (235 ms). [W][component:233]: Components should block for at most 30 ms. [D][select:062]: 'Wake word engine location' - Setting [D][select:115]: 'Wake word engine location' - Set selected option to: On device [D][select:015]: 'Wake word engine location': Sending state On device (index 1) [D][voice_assistant:523]: Event Type: 4 [D][voice_assistant:551]: Speech recognised as: " . . ." [D][text_sensor:064]: 'Assist query': Sending state ' . . .' [W][component:232]: Component voice_assistant took a long time for an operation (240 ms). [W][component:233]: Components should block for at most 30 ms. [D][voice_assistant:523]: Event Type: 5 [D][voice_assistant:556]: Intent started [D][voice_assistant:523]: Event Type: 6 [D][voice_assistant:523]: Event Type: 7

[D][text_sensor:064]: 'Assist reply': Sending state 'Sorry, I couldn't understand that' [D][voice_assistant:523]: Event Type: 8 [D][voice_assistant:599]: Response URL: "http://192.168.0.110/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-us_4d30e09a66_tts.piper.wav" [D][voice_assistant:416]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE [D][voice_assistant:422]: Desired state set to STREAMING_RESPONSE [D][voice_assistant:523]: Event Type: 2 [D][voice_assistant:613]: Assist Pipeline ended [D][esp-idf:000]: I (139872) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8

[D][esp-idf:000]: I (139876) I2S: I2S0, MCLK output by GPIO2

[D][esp-idf:000]: I (139880) AUDIO_PIPELINE: link el->rb, el:0x3d05d254, tag:raw, rb:0x3d05d3c4

[D][esp-idf:000]: I (139882) AUDIO_ELEMENT: [raw-0x3d05d254] Element task created0;36m[D][esp-idf:000]: I (139885) AUDIO_ELEMENT: [i2s-0x3d05cfb0] Element task created

[D][esp-idf:000]: I (139888) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1

[D][esp-idf:000]: I (139890) I2S_STREAM: AUDIO_STREAM_WRITER

[D][esp-idf:000]: I (139891) AUDIO_PIPELINE: Pipeline started

[W][component:232]: Component voice_assistant took a long time for an operation (268 ms). [W][component:233]: Components should bloc[D][select:062]: 'Wake word engine location' - Setting [D][select:115]: 'Wake word engine location' - Set selected option to: On device [D][select:015]: 'Wake word engine location': Sending state On device (index 1) [D][voice_assistant:416]: State changed from STARTING_MICROPHONE to STOP_MICROPHONE [D][voice_assistant:422]: Desired state set to IDLE [D][voice_assistant:416]: State changed from STOP_MICROPHONE to IDLE [W][component:232]: Component script took a long time for an operation (235 ms). [W][component:233]: Components should block for at most 30 ms. [W][component:232]: Component time took a long time for an operation (235 ms). [W][component:233]: Components should block for at most 30 ms. [D][esp32.preferences:114]: Saving 1 preferences to flash... [D][esp32.preferences:143]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed [W][component:232]: Component time took a long time for an operation (235 ms). [W][component:233]: Components should block for at most 30 ms. [I][ota:117]: Boot seems successful, resetting boot loop counter. [D][esp32.preferences:114]: Saving 1 preferences to flash... [D][esp32.preferences:143]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed [W][component:232]: Component time took a long time for an operation (236 ms). [W][component:233]: Components should block for at most 30 ms.

jaymunro commented 5 months ago

Home Assistant Wake Word Detection isn't working as intended

Thanks @DrShivang. In what way is it not working as intended? Is it freezing, not responding to wake word, not taking query, not giving a response, other?

If you are talking about the "..." in the response/query, that is something @jlpouffier put in. I think he may have intended it as a filler while listening but I found it wasn't working nicely with the HA sensor history so I disabled it when the Continued conversation is turned on. But this is not related to the In device / In HA switch so maybe you're talking about something else?

DrShivang commented 5 months ago

Wake word detection isn't working when it's changed to "In Home Assistant" from "On Device". If anyone can confirm this issue. Tried changing the models and even to the default ones but to no avail.. for me it was working with on device wake word detection only. With the default framework it is working with both on device and in home assistant detection.

Let me know if you need any further details.

On Thu, Apr 4, 2024, 3:56 AM John @.***> wrote:

Home Assistant Wake Word Detection isn't working as intended

Thanks @DrShivang https://github.com/DrShivang. In what way is it not working as intended? Is it freezing, not responding to wake word, not taking query, not giving a response, other?

If you are talking about the "..." in the response/query, that is something @jlpouffier https://github.com/jlpouffier put in. I think he may have intended it as a filler while listening but I found it wasn't working nicely with the HA sensor history so I disabled it when the Continued conversation is turned on. But this is not related to the In device / In HA switch so maybe you're talking about something else?

— Reply to this email directly, view it on GitHub https://github.com/esphome/firmware/pull/173#issuecomment-2035712687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWOTAJKCBKTSBLLQDTU3KNDY3SFZ5AVCNFSM6AAAAABED4YVO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVG4YTENRYG4 . You are receiving this because you were mentioned.Message ID: @.***>

a-d-r-i-a-n-d commented 5 months ago

Hey @jaymunro , this looks great, thanks for your work. I've got a esp32-s3-box and I can help testing if you point me to the right directions.

jaymunro commented 5 months ago

esp32-s3-box

At the moment it is set up for the Box3 but functionality should be easily transferable to the Box once the merge is complete. If the merge is not going to happen I'm not sure if it's worth spending the time on it. Personally I think it's fantastic and I frankly should've entered it into that competition but didn't think about it. I've no idea why there is no activity on the merging except busy maintainers and a lack of time to visit this thread. @jesserockz ?

jaymunro commented 5 months ago

Wake word detection isn't working when it's changed to "In Home Assistant" from "On Device". If anyone can confirm this issue.

I have been able to reproduce this by moving from 'On device' to 'In Home Assistant'. If the device wakes up with 'In HA' selected it works (e.g. turning 'Mute' on and off again).

I'll try and track down why and add a fix.

jaymunro commented 5 months ago

I think that update fixes the "Home Assistant Wake Word Detection isn't working as intended". Thanks so much @DrShivang for finding that.

DrShivang commented 5 months ago

@jaymunro , Thanks for the update I'll test and revert.

Also see this great achievement by @X-Ryl669 at https://github.com/esphome/issues/issues/5296 All the sensors are functioning well and tested.

I'll try if we can integrate both of these together.

X-Ryl669 commented 5 months ago

Didn't know about this PR. I'm French (but I speak English well, I think) and for me, the default "Ok Nabu" almost never trigger on HA. Maybe 1 out of 10 tries, which makes the system kind of useless for its purpose. The wakeword "Hi ESP" from the default firmware from Espressif works 95% of the time in comparison.

What's strange is that the Assist on HA's webpage is working almost correct for STT recognition and TTS generation. So, when the Wake word is actually detected the upcoming command is actually working most of the time.

I've started collecting samples for training my own wake word, and using microwakeword, and for this, I need to modify the ADF component (I've worked on this part to use the latest version, see main esphome's pull request and related issue #5296

I was wondering about few improvement to the current code, please comment if you agree (or not) with me:

Short tasks should be replaced by permanent tasks

Currently, the esp-adf component create 2 sub components (speaker & microphone). Each of these sub components starts a task with an audio pipeline and stop that task when processing a batch of data. This makes a lot of tasks creation and deletion, and a lot of allocations (since each tasks creates numerous buffers via malloc). This means that after some time, the system will reboot since the allocator will fails to allocate due to memory fragmentation. Also, you'll always hear a small "pop" or "click" when the speaker task is recreated, since the last samples in the audio buffer aren't always fading to 0, you'll get a discontinuity.

I think the tasks should be started once and kept alive for the entire runtime of the system. State tracking should be implemented (so that the microphone isn't streaming while the speaker is outputting sound). I think it's possible to do so without too much changes.

Voice assistant should have a feature to record audio

In order to train for wakeword, it's absolutely required to use the same environment that'll be used for actual inference (same audio pipeline, same device). Using a TTS to generate samples for the wakeword doesn't work well since the TTS quality will likely be higher than the actual sound.

The current screen isn't very useful

I think the LVGL PR in #6363 should be merged in. This would allow to have a real interface, that's displaying a kind of chat window (conversation history) and also would allow to have some output to the "what can I say" intent. Having to store huge PNG in the firmware for the interface is clunky. In LVGL you can store SVG or simply a TTF font for the current icons.

BigBobbas commented 5 months ago

Didn't know about this PR. I'm French (but I speak English well, I think) and for me, the default "Ok Nabu" almost never trigger on HA. Maybe 1 out of 10 tries, which makes the system kind of useless for its purpose. The wakeword "Hi ESP" from the default firmware from Espressif works 95% of the time in comparison.

What's strange is that the Assist on HA's webpage is working almost correct for STT recognition and TTS generation. So, when the Wake word is actually detected the upcoming command is actually working most of the time.

I've started collecting samples for training my own wake word, and using microwakeword, and for this, I need to modify the ADF component (I've worked on this part to use the latest version, see main esphome's pull request and related issue #5296

I was wondering about few improvement to the current code, please comment if you agree (or not) with me:

Short tasks should be replaced by permanent tasks

Currently, the esp-adf component create 2 sub components (speaker & microphone). Each of these sub components starts a task with an audio pipeline and stop that task when processing a batch of data. This makes a lot of tasks creation and deletion, and a lot of allocations (since each tasks creates numerous buffers via malloc). This means that after some time, the system will reboot since the allocator will fails to allocate due to memory fragmentation. Also, you'll always hear a small "pop" or "click" when the speaker task is recreated, since the last samples in the audio buffer aren't always fading to 0, you'll get a discontinuity.

I think the tasks should be started once and kept alive for the entire runtime of the system. State tracking should be implemented (so that the microphone isn't streaming while the speaker is outputting sound). I think it's possible to do so without too much changes.

Voice assistant should have a feature to record audio

In order to train for wakeword, it's absolutely required to use the same environment that'll be used for actual inference (same audio pipeline, same device). Using a TTS to generate samples for the wakeword doesn't work well since the TTS quality will likely be higher than the actual sound.

The current screen isn't very useful

I think the LVGL PR in #6363 should be merged in. This would allow to have a real interface, that's displaying a kind of chat window (conversation history) and also would allow to have some output to the "what can I say" intent. Having to store huge PNG in the firmware for the interface is clunky. In LVGL you can store SVG or simply a TTF font for the current icons.

you may be interested in this https://github.com/gnumpi/esphome_audio which I believe that the creator of micro_wake_word has also been talking with the dev to make improvements. This component provides an adf pipeline so media player can also now be used on esp-idf framework.

DrShivang commented 5 months ago

Didn't know about this PR. I'm French (but I speak English well, I think) and for me, the default "Ok Nabu" almost never trigger on HA. Maybe 1 out of 10 tries, which makes the system kind of useless for its purpose. The wakeword "Hi ESP" from the default firmware from Espressif works 95% of the time in comparison.

What's strange is that the Assist on HA's webpage is working almost correct for STT recognition and TTS generation. So, when the Wake word is actually detected the upcoming command is actually working most of the time.

I've started collecting samples for training my own wake word, and using microwakeword, and for this, I need to modify the ADF component (I've worked on this part to use the latest version, see main esphome's pull request and related issue #5296

I was wondering about few improvement to the current code, please comment if you agree (or not) with me:

Short tasks should be replaced by permanent tasks

Currently, the esp-adf component create 2 sub components (speaker & microphone). Each of these sub components starts a task with an audio pipeline and stop that task when processing a batch of data. This makes a lot of tasks creation and deletion, and a lot of allocations (since each tasks creates numerous buffers via malloc). This means that after some time, the system will reboot since the allocator will fails to allocate due to memory fragmentation. Also, you'll always hear a small "pop" or "click" when the speaker task is recreated, since the last samples in the audio buffer aren't always fading to 0, you'll get a discontinuity.

I think the tasks should be started once and kept alive for the entire runtime of the system. State tracking should be implemented (so that the microphone isn't streaming while the speaker is outputting sound). I think it's possible to do so without too much changes.

Voice assistant should have a feature to record audio

In order to train for wakeword, it's absolutely required to use the same environment that'll be used for actual inference (same audio pipeline, same device). Using a TTS to generate samples for the wakeword doesn't work well since the TTS quality will likely be higher than the actual sound.

The current screen isn't very useful

I think the LVGL PR in #6363 should be merged in. This would allow to have a real interface, that's displaying a kind of chat window (conversation history) and also would allow to have some output to the "what can I say" intent. Having to store huge PNG in the firmware for the interface is clunky. In LVGL you can store SVG or simply a TTF font for the current icons.

Agree, these would be great enhancements. Especially training samples generated by record voice feature.

docics commented 4 months ago

Can I use this on other esp32 boards? I currently have 3 esp32wroom boards that I use. But I would like to implement continuous conversation too.

jaymunro commented 4 months ago

Can I use this on other esp32 boards? I currently have 3 esp32wroom boards that I use. But I would like to implement continuous conversation too.

It should work without too many changes but it should be on an ESP32 S3 to use this code.

william-aqn commented 3 months ago

Great addition! I'm really looking forward to it appearing in the main branch.

esphome / firmware