ESP32-S3-Box-3 voice assistant: sound is delayed when playing sound during wake word trigger

stu247 commented 9 months ago

The problem

Using the official esp32-s3-box-3.yaml, I have added playing a sound (using speaker.play) on trigger on_wake_word_detected and another sound on trigger on_stt_vad_end. The wake work sound is not played at the right point in time. The wake word sound is played after I give a command, not before. I've also tried using triggers on_listening and on_stt_vad_start, but these did not work either. I get the impression that the wake word sound is queued until after I give the command.

In order to get sounds to play, I had to comment out BLE in esp32-s3-box-3.yaml.

Which version of ESPHome has the issue?

2023.11.6

What type of installation are you using?

pip

Which version of Home Assistant has the issue?

2023.11.3

What platform are you using?

ESP32-IDF

Board

esp32s3box

Component causing the issue

voice assistant

Example YAML snippet

I've added the yaml files to https://github.com/stu247/voice-assistant  The top level file to run is orange-box-3.yaml

Anything in the logs that might be useful for us?

[12:37:21][D][voice_assistant:189]: VAD detected speech
[12:37:21][D][voice_assistant:422]: State changed from WAITING_FOR_VAD to START_PIPELINE
[12:37:21][D][voice_assistant:428]: Desired state set to STREAMING_MICROPHONE
[12:37:21][D][voice_assistant:206]: Requesting start...
[12:37:21][D][voice_assistant:422]: State changed from START_PIPELINE to STARTING_PIPELINE
[12:37:21][D][voice_assistant:443]: Client started, streaming microphone
[12:37:21][D][voice_assistant:422]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[12:37:21][D][voice_assistant:428]: Desired state set to STREAMING_MICROPHONE
[12:37:21][D][voice_assistant:529]: Event Type: 1
[12:37:21][D][voice_assistant:532]: Assist Pipeline running
[12:37:21][D][voice_assistant:529]: Event Type: 9
[12:37:22][D][voice_assistant:529]: Event Type: 10
[12:37:22][D][voice_assistant:538]: Wake word detected
[12:37:22][D][voice_assistant:529]: Event Type: 3
[12:37:22][D][voice_assistant:543]: STT started
[12:37:22][W][component:214]: Component voice_assistant took a long time for an operation (0.17 s).
[12:37:22][W][component:215]: Components should block for at most 20-30ms.
[12:37:22][D][voice_assistant:529]: Event Type: 11
[12:37:22][D][voice_assistant:680]: Starting STT by VAD
[12:37:24][D][voice_assistant:529]: Event Type: 12
[12:37:24][D][voice_assistant:684]: STT by VAD end
[12:37:24][D][voice_assistant:422]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[12:37:24][D][voice_assistant:428]: Desired state set to AWAITING_RESPONSE
[12:37:24][D][esp_adf.microphone:256]: Stopping microphone
[12:37:24][D][voice_assistant:422]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[12:37:24][D][esp-idf:000]: W (25647) AUDIO_ELEMENT: IN-[filter] AEL_IO_ABORT
[12:37:24]
[12:37:24][D][esp-idf:000]: E (25649) AUDIO_ELEMENT: [filter] Element already stopped
[12:37:24]
[12:37:24][D][esp-idf:000]: W (25681) AUDIO_PIPELINE: There are no listener registered
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25683) AUDIO_PIPELINE: audio_pipeline_unlinked
[12:37:24][D][esp-idf:000]: W (25683) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25687) I2S: DMA queue destroyed
[12:37:24]
[12:37:24][D][esp-idf:000]: W (25687) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[12:37:24]
[12:37:24][D][esp-idf:000]: W (25689) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE
[12:37:24]
[12:37:24][W][component:214]: Component voice_assistant took a long time for an operation (0.18 s).
[12:37:24][W][component:215]: Components should block for at most 20-30ms.
[12:37:24][D][esp_adf.microphone:306]: Microphone stopped
[12:37:24][D][voice_assistant:422]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[12:37:24][D][esp-idf:000]: I (25813) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25817) I2S: I2S0, MCLK output by GPIO2
[12:37:24]
[12:37:24][D][debug:393]: Free Heap Size: 29128 bytes
[12:37:24][D][esp-idf:000]: I (25821) AUDIO_PIPELINE: link el->rb, el:0x3d036af4, tag:raw, rb:0x3d036c64
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25823) AUDIO_ELEMENT: [raw-0x3d036af4] Element task created
[12:37:24][D][esp-idf:000]: I (25826) AUDIO_ELEMENT: [i2s-0x3d036850] Element task created
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25828) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:16567439 Bytes, Inter:24864 Bytes, Dram:24864 Bytes
[12:37:24]
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25830) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25831) I2S_STREAM: AUDIO_STREAM_WRITER
[12:37:24]
[12:37:24][D][esp-idf:000]: I (25833) AUDIO_PIPELINE: Pipeline started
[12:37:24]
[12:37:26][D][debug:393]: Free Heap Size: 11008 bytes
[12:37:26][D][voice_assistant:529]: Event Type: 4
[12:37:26][D][voice_assistant:557]: Speech recognised as: " What's the outside temperature?"
[12:37:26][D][voice_assistant:529]: Event Type: 5
[12:37:26][D][voice_assistant:562]: Intent started
[12:37:26][D][voice_assistant:529]: Event Type: 6
[12:37:26][D][voice_assistant:529]: Event Type: 7
[12:37:26][D][voice_assistant:585]: Response: "Outside temperature is 52 °F"
[12:37:26][D][voice_assistant:529]: Event Type: 8
[12:37:26][D][voice_assistant:605]: Response URL: "https://camp15074.duckdns.org:8123/api/tts_proxy/4721ef3227d503b9aae6fcd3a9f616bafff1288c_en-us_f0d2f7854c_tts.piper.raw"
[12:37:26][D][voice_assistant:422]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE
[12:37:26][D][voice_assistant:428]: Desired state set to STREAMING_RESPONSE
[12:37:26][D][voice_assistant:529]: Event Type: 2
[12:37:26][D][voice_assistant:619]: Assist Pipeline ended
[12:37:26][D][voice_assistant:529]: Event Type: 98
[12:37:26][D][voice_assistant:667]: TTS stream start
[12:37:26][W][component:214]: Component voice_assistant took a long time for an operation (0.17 s).
[12:37:26][W][component:215]: Components should block for at most 20-30ms.
[12:37:28][D][voice_assistant:529]: Event Type: 99
[12:37:28][D][voice_assistant:675]: TTS stream end
[12:37:28][D][voice_assistant:293]: End of audio stream received
[12:37:28][D][voice_assistant:422]: State changed from STREAMING_RESPONSE to RESPONSE_FINISHED
[12:37:28][D][voice_assistant:428]: Desired state set to RESPONSE_FINISHED
[12:37:28][D][voice_assistant:325]: Speaker has finished outputting all audio
[12:37:28][D][voice_assistant:422]: State changed from RESPONSE_FINISHED to IDLE
[12:37:28][D][voice_assistant:428]: Desired state set to IDLE
[12:37:28][W][component:214]: Component voice_assistant took a long time for an operation (0.18 s).
[12:37:28][W][component:215]: Components should block for at most 20-30ms.
[12:37:28][D][voice_assistant:422]: State changed from IDLE to START_MICROPHONE
[12:37:28][D][voice_assistant:428]: Desired state set to WAIT_FOR_VAD
[12:37:28][D][voice_assistant:159]: Starting Microphone
[12:37:28][D][voice_assistant:422]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[12:37:30][D][esp-idf:000]: W (31321) AUDIO_PIPELINE: There are no listener registered
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31323) AUDIO_PIPELINE: audio_pipeline_unlinked
[12:37:30]
[12:37:30][D][esp-idf:000]: W (31325) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINAT
[12:37:30][D][esp-idf:000]: I (31327) I2S: DMA queue destroyed
[12:37:30]
[12:37:30][D][esp-idf:000]: W (31329) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE
[12:37:30]
[12:37:30][D][esp-idf:000]: W (31333) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31339) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[12:37:30][D][esp-idf:000]: I (31343) I2S: I2S0, MCLK output by GPIO2
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31347) AUDIO_PIPELINE: link el->rb, el:0x3d036c54, tag:i2s, rb:0x3d037068
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31349) AUDIO_PIPELINE: link el->rb, el:0x3d036dc8, tag:filter, rb:0x3d0390a8
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31352) AUDIO_ELEMENT: [i2s-0x3d036c54] Element task created
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31354) AUDIO_THREAD: The filter task allocate stack on external memory
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31357) AUDIO_ELEMENT: [filter-0x3d036dc8] Element task created
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31359) AUDIO_ELEMENT: [raw-0x3d036ef8] Element task created
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31361) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:16571727 Bytes, Inter:36392 Bytes, Dram:36392 Byte
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31363) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31365) AUDIO_ELEMENT: [filter] AEL_MSG_CMD_RESUME,state:1
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31367) RSP_FILTER: sample rate of source data : 16000, channel of source data : 2, sample rate of destination data : 16000, channel of destination data : 1
[12:37:30]
[12:37:30][D][esp-idf:000]: I (31370) AUDIO_PIPELINE: Pipeline started
[12:37:30]
[12:37:30][D][esp_adf.microphone:294]: Microphone started
[12:37:30][D][voice_assistant:422]: State changed from STARTING_MICROPHONE to WAIT_FOR_VAD
[12:37:30][D][voice_assistant:176]: Waiting for speech...
[12:37:30][D][voice_assistant:422]: State changed from WAIT_FOR_VAD to WAITING_FOR_VAD

Additional information

No response

stu247 commented 7 months ago

I am still having the same issue with esphome-2023.12.8, HA 2024.1.1 and the latest official YAML. See https://github.com/stu247/voice-assistant

stu247 commented 7 months ago

I just tried using esphome-2023.12.8 to have the sounds play without commenting out ble in the official yaml file and they did play. In a older version of esphome, the sounds would not play with ble turned on. So that's cool. However, the wake word detection sound still plays at the wrong time.

BigBobbas commented 7 months ago

I just tried using esphome-2023.12.8 to have the sounds play without commenting out ble in the official yaml file and they did play. In a older version of esphome, the sounds would not play with ble turned on. So that's cool. However, the wake word detection sound still plays at the wrong time.

it is the same with all esp devices that the 'wakeword' detected sound plays after the command is issued and not wake word, the only time it does work in the right place is if the audio is outputted to a different media player within HA. I think the reason is because the mic and speaker cannot both be active at the same time.

stu247 commented 7 months ago

When speaker.play is called, I think it would be reasonable not to expect the mic to be active while playing the sound. A potential solution to this issue is for speaker.play to turn off the mic while it is playing the sound and then turn the mic back on afterwards (assuming the mic is active when speaker.play is called).

You mentioned "a different media player within HA". I just want to make it clear that this scenario does not involve HA trying to play a sound. The sound is defined inside the ESPHome Yaml file and played via the voice_assistant on_wake_word_detected trigger.

esphome / issues