esphome / issues

Issue Tracker for ESPHome
https://esphome.io/
290 stars 35 forks source link

Voice assistant: first response from CloudTTS is never played #4985

Open s00500 opened 11 months ago

s00500 commented 11 months ago

The problem

I am running assist on a RaspiAudio Muse Proto (the config I patched yesterday, with an added switch to debug mute_pinn state, but that is fine now!)

When I use a trigger that changes an entety everything works fine, but I have a custom intent for "Whats the time" It is ony an intent that throws back a templated sentence with the time. Then I use a static sentence everything works, but when I use a template things start to be weird:

Esphome stops and does not speak the audio even though it receives the link from the HA cloud tts and everything seems fine. Then it stops to listen for the wakeword... detection will only start again when I stop and start wakeword listening using the esphome switch for wakeword in HA ....

This is really a strange issue... anyone have any ideas what to test ?

Which version of ESPHome has the issue?

2023.10.0b1

What type of installation are you using?

Home Assistant Add-on

Which version of Home Assistant has the issue?

No response

What platform are you using?

ESP32-IDF

Board

RaspiAudio Muse Proto

Component causing the issue

voice_assistant

Example YAML snippet

# Homeassistant config
# Sentences:
intents:
  Time:
    data:
      - sentences:
          - "tell me the time"
          - "whats the time"
          - "What's the time"
          - "what is the time"
          - "how late is it"

# Intentscript:
Time:
  speech:
    text: 'The Time is now {{ now().hour }} {{ now().minute }}'

Anything in the logs that might be useful for us?

[19:42:47][D][switch:012]: 'Use wake word' Turning ON.
[19:42:47][D][switch:055]: 'Use wake word': Sending state ON
[19:42:47][D][voice_assistant:313]: State changed from 0 to 1
[19:42:47][D][voice_assistant:319]: Desired state set to 3
[19:42:47][D][light:036]: 'Muse2' Setting:
[19:42:47][D][light:047]:   State: ON
[19:42:47][D][light:051]:   Brightness: 40%
[19:42:47][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[19:42:47][D][light:085]:   Transition length: 1.0s
[19:42:47][D][voice_assistant:159]: Starting Microphone
[19:42:47][D][voice_assistant:313]: State changed from 1 to 2
[19:42:47][D][esp-idf:000]: I (768248) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4

[19:42:47][D][voice_assistant:313]: State changed from 2 to 3
[19:42:47][D][voice_assistant:176]: Waiting for speech...
[19:42:47][D][voice_assistant:313]: State changed from 3 to 4
[19:42:47][D][voice_assistant:189]: VAD detected speech
[19:42:47][D][voice_assistant:313]: State changed from 4 to 5
[19:42:47][D][voice_assistant:319]: Desired state set to 7
[19:42:47][D][voice_assistant:206]: Requesting start...
[19:42:47][D][voice_assistant:313]: State changed from 5 to 6
[19:42:47][D][voice_assistant:334]: Client started, streaming microphone
[19:42:47][D][voice_assistant:313]: State changed from 6 to 7
[19:42:47][D][voice_assistant:319]: Desired state set to 7
[19:42:47][D][voice_assistant:414]: Event Type: 1
[19:42:47][D][voice_assistant:417]: Assist Pipeline running
[19:42:47][D][voice_assistant:414]: Event Type: 9
[19:42:48][D][esp32.preferences:114]: Saving 1 preferences to flash...
[19:42:48][D][esp32.preferences:143]: Saving 1 preferences to flash: 1 cached, 0 written, 0 failed
[19:42:52][D][voice_assistant:414]: Event Type: 0
[19:42:52][D][voice_assistant:414]: Event Type: 2
[19:42:52][D][voice_assistant:496]: Assist Pipeline ended
[19:42:52][D][voice_assistant:313]: State changed from 7 to 3
[19:42:52][D][voice_assistant:319]: Desired state set to 4
[19:42:52][D][switch:012]: 'mute' Turning ON.
[19:42:52][D][light:036]: 'Muse2' Setting:
[19:42:52][D][light:051]:   Brightness: 40%
[19:42:52][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[19:42:52][D][light:085]:   Transition length: 1.0s
[19:42:52][D][voice_assistant:176]: Waiting for speech...
[19:42:52][D][voice_assistant:313]: State changed from 3 to 4
[19:42:52][D][voice_assistant:189]: VAD detected speech
[19:42:52][D][voice_assistant:313]: State changed from 4 to 5
[19:42:52][D][voice_assistant:319]: Desired state set to 7
[19:42:53][D][voice_assistant:206]: Requesting start...
[19:42:53][D][voice_assistant:313]: State changed from 5 to 6
[19:42:53][D][voice_assistant:334]: Client started, streaming microphone
[19:42:53][D][voice_assistant:313]: State changed from 6 to 7
[19:42:53][D][voice_assistant:319]: Desired state set to 7
[19:42:53][D][voice_assistant:414]: Event Type: 1
[19:42:53][D][voice_assistant:417]: Assist Pipeline running
[19:42:53][D][voice_assistant:414]: Event Type: 9
[19:42:57][D][voice_assistant:414]: Event Type: 0
[19:42:57][D][voice_assistant:414]: Event Type: 2
[19:42:57][D][voice_assistant:496]: Assist Pipeline ended
[19:42:57][D][voice_assistant:313]: State changed from 7 to 3
[19:42:57][D][voice_assistant:319]: Desired state set to 4
[19:42:57][D][switch:012]: 'mute' Turning ON.
[19:42:57][D][light:036]: 'Muse2' Setting:
[19:42:57][D][light:051]:   Brightness: 40%
[19:42:57][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[19:42:57][D][light:085]:   Transition length: 1.0s
[19:42:57][D][voice_assistant:176]: Waiting for speech...
[19:42:57][D][voice_assistant:313]: State changed from 3 to 4
[19:42:58][D][voice_assistant:189]: VAD detected speech
[19:42:58][D][voice_assistant:313]: State changed from 4 to 5
[19:42:58][D][voice_assistant:319]: Desired state set to 7
[19:42:58][D][voice_assistant:206]: Requesting start...
[19:42:58][D][voice_assistant:313]: State changed from 5 to 6
[19:42:58][D][voice_assistant:334]: Client started, streaming microphone
[19:42:58][D][voice_assistant:313]: State changed from 6 to 7
[19:42:58][D][voice_assistant:319]: Desired state set to 7
[19:42:58][D][voice_assistant:414]: Event Type: 1
[19:42:58][D][voice_assistant:417]: Assist Pipeline running
[19:42:58][D][voice_assistant:414]: Event Type: 9
[19:43:03][D][voice_assistant:414]: Event Type: 10
[19:43:03][D][voice_assistant:423]: Wake word detected
[19:43:03][D][voice_assistant:414]: Event Type: 3
[19:43:03][D][voice_assistant:428]: STT Started
[19:43:03][D][switch:012]: 'mute' Turning ON.
[19:43:03][D][light:036]: 'Muse2' Setting:
[19:43:03][D][light:051]:   Brightness: 40%
[19:43:03][D][light:059]:   Red: 0%, Green: 0%, Blue: 100%
[19:43:03][D][light:109]:   Effect: 'Pulse'
[19:43:05][D][voice_assistant:414]: Event Type: 4
[19:43:05][D][voice_assistant:313]: State changed from 7 to 8
[19:43:05][D][voice_assistant:319]: Desired state set to 10
[19:43:05][D][voice_assistant:443]: Speech recognised as: "What's the time?"
[19:43:05][D][voice_assistant:313]: State changed from 8 to 9
[19:43:05][D][esp-idf:000]: I (785963) I2S: DMA queue destroyed

[19:43:05][D][voice_assistant:414]: Event Type: 5
[19:43:05][D][voice_assistant:535]: Unhandled event type: 5
[19:43:05][D][voice_assistant:313]: State changed from 9 to 10
[19:43:05][D][voice_assistant:414]: Event Type: 6
[19:43:05][D][voice_assistant:414]: Event Type: 7
[19:43:05][D][voice_assistant:466]: Response: "The Time is now 19 43"
[19:43:05][D][switch:016]: 'mute' Turning OFF.
[19:43:05][D][switch:055]: 'mute': Sending state OFF
[19:43:05][D][light:036]: 'Muse2' Setting:
[19:43:05][D][light:051]:   Brightness: 40%
[19:43:05][D][light:059]:   Red: 0%, Green: 100%, Blue: 0%
[19:43:05][D][light:085]:   Transition length: 1.0s
[19:43:05][D][voice_assistant:414]: Event Type: 8
[19:43:05][D][voice_assistant:484]: Response URL: "https://XXX.ui.nabu.casa/api/tts_proxy/f5214f61bf61a70910114c80b1e791c658468fb0_en-au_72fdaba41d_cloud.raw"
[19:43:05][D][voice_assistant:313]: State changed from 10 to 11
[19:43:05][D][voice_assistant:319]: Desired state set to 11
[19:43:05][D][voice_assistant:414]: Event Type: 2
[19:43:05][D][voice_assistant:496]: Assist Pipeline ended
[19:43:05][D][esp-idf:000]: I (786483) I2S: DMA queue destroyed

[19:43:05][D][switch:012]: 'mute' Turning ON.
[19:43:05][D][switch:055]: 'mute': Sending state ON
[19:43:05][D][light:036]: 'Muse2' Setting:
[19:43:05][D][light:051]:   Brightness: 40%
[19:43:05][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[19:43:05][D][light:085]:   Transition length: 1.0s
[19:43:05][D][light:091]:   Effect: 'None'
[19:43:06][D][voice_assistant:313]: State changed from 11 to 0
[19:43:06][D][voice_assistant:319]: Desired state set to 0
[19:43:06][D][voice_assistant:313]: State changed from 0 to 1
[19:43:06][D][voice_assistant:319]: Desired state set to 3
[19:43:06][D][voice_assistant:159]: Starting Microphone
[19:43:06][D][voice_assistant:313]: State changed from 1 to 2
[19:43:06][D][esp-idf:000]: I (786615) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4

[19:43:06][D][voice_assistant:313]: State changed from 2 to 3
[19:43:06][D][voice_assistant:176]: Waiting for speech...
[19:43:06][D][voice_assistant:313]: State changed from 3 to 4
[19:43:06][D][voice_assistant:189]: VAD detected speech
[19:43:06][D][voice_assistant:313]: State changed from 4 to 5
[19:43:06][D][voice_assistant:319]: Desired state set to 7
[19:43:06][D][voice_assistant:206]: Requesting start...
[19:43:06][D][voice_assistant:313]: State changed from 5 to 6
[19:43:06][D][voice_assistant:334]: Client started, streaming microphone
[19:43:06][D][voice_assistant:313]: State changed from 6 to 7
[19:43:06][D][voice_assistant:319]: Desired state set to 7
[19:43:06][D][voice_assistant:414]: Event Type: 1
[19:43:06][D][voice_assistant:417]: Assist Pipeline running
[19:43:06][D][voice_assistant:414]: Event Type: 9

Additional information

No response

jesserockz commented 11 months ago

This seems really strange as ESPHome does not care about the sentence type etc. It is just sent audio via udp from HA and streams that to the speaker.

Have you verified the linked audio file works on your computer?

s00500 commented 11 months ago

since I get that as a .raw ... what do I do to test it on my computer ? When I use the same input to send stuff to my sonos it works... but then I assume it uses the wav link ?

s00500 commented 11 months ago

Here is a file that worked: 2e70d00c1e5e3ff0ee3a2bda989c7cbb635ad619_en-au_72fdaba41d_cloud.raw.working.zip

And here is one that failed: 4268b851211cd06cf30fc58a2069898d5d27fec6_en-au_72fdaba41d_cloud.raw.failing.zip

Judging by the size of the files both seem to be ok... could this have to do something with timing of the response ? I assume when I add time delays to the on_tts_start: and similar handlers that they do not block the main execution ? or would they ?

s00500 commented 11 months ago

Ok, I just upgraded to Version: 2023.10.1 and this fixes that it locks up wake word detection after a fail...

But still it would not play audio on the first time

That means since my template changes every minute I can always get it to trigger properly if I just quickly ask it again.

The question is why does it not work the very first time... I assume that the first time the audio file takes longer to generate and is not cached... and for some reason if screws up... esphome actually goes into speaker playing state anyways.... I tried to experiment with loglevel to see further what is going on...

This issue feels very reproducable though.... Just use muse proto config from repo, then use homeassistant cloud as tts response in your pipeline and create some template response that changes

s00500 commented 11 months ago

Here is a full pass, first failed, second worked, third also worked (was audible): All 3 texts are the same

[23:17:09][I][app:102]: ESPHome version 2023.10.1 compiled on Oct 19 2023, 23:11:42

[23:17:09][D][voice_assistant:468]: Event Type: 0 [23:17:09][D][voice_assistant:468]: Event Type: 2 [23:17:09][D][voice_assistant:550]: Assist Pipeline ended [23:17:09][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to IDLE [23:17:09][D][voice_assistant:372]: Desired state set to IDLE [23:17:09][D][main:288]: VA: ON END [23:17:09][D][main:297]: VA: speaker stopped [23:17:09][D][switch:012]: 'mute' Turning ON. [23:17:09][D][light:036]: 'Muse2' Setting: [23:17:09][D][light:051]: Brightness: 40% [23:17:09][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:09][D][light:085]: Transition length: 1.0s [23:17:09][D][voice_assistant:366]: State changed from IDLE to START_PIPELINE [23:17:09][D][voice_assistant:372]: Desired state set to START_MICROPHONE [23:17:09][D][voice_assistant:206]: Requesting start... [23:17:09][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE [23:17:09][D][voice_assistant:387]: Client started, streaming microphone [23:17:09][D][voice_assistant:366]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE [23:17:09][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE [23:17:09][D][voice_assistant:468]: Event Type: 1 [23:17:09][D][voice_assistant:471]: Assist Pipeline running [23:17:09][D][voice_assistant:468]: Event Type: 9 [23:17:11][D][voice_assistant:468]: Event Type: 10 [23:17:11][D][voice_assistant:477]: Wake word detected [23:17:11][D][voice_assistant:468]: Event Type: 3 [23:17:11][D][voice_assistant:482]: STT Started [23:17:11][D][main:463]: VA: listening [23:17:11][D][switch:012]: 'mute' Turning ON. [23:17:11][D][light:036]: 'Muse2' Setting: [23:17:11][D][light:051]: Brightness: 40% [23:17:11][D][light:059]: Red: 0%, Green: 0%, Blue: 100%

[23:17:14][D][voice_assistant:468]: Event Type: 4 [23:17:14][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [23:17:14][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE [23:17:14][D][voice_assistant:497]: Speech recognised as: "What's the time?" [23:17:14][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [23:17:14][V][esp-idf:000]: I (315696) I2S: DMA queue destroyed

[23:17:14][D][voice_assistant:468]: Event Type: 5 [23:17:14][D][voice_assistant:599]: Unhandled event type: 5 [23:17:14][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [23:17:14][D][voice_assistant:468]: Event Type: 6 [23:17:14][D][voice_assistant:468]: Event Type: 7

[23:17:14][D][main:277]: VA: TTSStart [23:17:14][D][switch:016]: 'mute' Turning OFF. [23:17:14][D][switch:055]: 'mute': Sending state OFF [23:17:14][D][voice_assistant:468]: Event Type: 8 [23:17:14][D][voice_assistant:538]: Response URL: "https://XXX.ui.nabu.casa/api/tts_proxy/a26788952fdb71fd098d8c7e52253145ec678685_en-au_72fdaba41d_cloud.raw" [23:17:14][D][voice_assistant:366]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE [23:17:14][D][voice_assistant:372]: Desired state set to STREAMING_RESPONSE [23:17:14][D][main:283]: VA: TTS END [23:17:14][D][i2s_audio.speaker:161]: Starting I2S Audio Speaker [23:17:14][D][voice_assistant:468]: Event Type: 2 [23:17:14][D][voice_assistant:550]: Assist Pipeline ended [23:17:14][D][main:288]: VA: ON END [23:17:14][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:14][D][i2s_audio.speaker:167]: Stopping I2S Audio Speaker [23:17:14][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:14][D][main:297]: VA: speaker stopped [23:17:14][D][switch:012]: 'mute' Turning ON. [23:17:14][D][switch:055]: 'mute': Sending state ON [23:17:14][D][light:036]: 'Muse2' Setting: [23:17:14][D][light:051]: Brightness: 40% [23:17:14][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:14][D][light:085]: Transition length: 1.0s

[23:17:15][D][i2s_audio.speaker:161]: Starting I2S Audio Speaker [23:17:15][V][esp-idf:000]: I (316687) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8

[23:17:15][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:18][V][esp-idf:000]: I (320080) I2S: DMA queue destroyed

[23:17:18][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:20][D][voice_assistant:366]: State changed from STREAMING_RESPONSE to IDLE [23:17:20][D][voice_assistant:372]: Desired state set to IDLE [23:17:20][D][voice_assistant:366]: State changed from IDLE to START_PIPELINE [23:17:20][D][voice_assistant:372]: Desired state set to START_MICROPHONE [23:17:20][D][voice_assistant:124]: microphone not running [23:17:20][D][voice_assistant:206]: Requesting start... [23:17:20][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE [23:17:20][D][voice_assistant:387]: Client started, streaming microphone [23:17:20][D][voice_assistant:366]: State changed from STARTING_PIPELINE to START_MICROPHONE [23:17:20][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE [23:17:20][D][voice_assistant:159]: Starting Microphone [23:17:20][D][voice_assistant:366]: State changed from START_MICROPHONE to STARTING_MICROPHONE [23:17:20][V][esp-idf:000]: I (322148) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4

[23:17:20][D][voice_assistant:468]: Event Type: 1 [23:17:20][D][voice_assistant:471]: Assist Pipeline running [23:17:20][D][voice_assistant:366]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE [23:17:20][D][voice_assistant:468]: Event Type: 9 [23:17:23][D][voice_assistant:468]: Event Type: 10 [23:17:23][D][voice_assistant:477]: Wake word detected [23:17:23][D][voice_assistant:468]: Event Type: 3 [23:17:23][D][voice_assistant:482]: STT Started [23:17:23][D][main:463]: VA: listening [23:17:23][D][switch:012]: 'mute' Turning ON. [23:17:23][D][light:036]: 'Muse2' Setting: [23:17:23][D][light:051]: Brightness: 40% [23:17:23][D][light:059]: Red: 0%, Green: 0%, Blue: 100%

[23:17:26][D][voice_assistant:468]: Event Type: 4 [23:17:26][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [23:17:26][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE [23:17:26][D][voice_assistant:497]: Speech recognised as: "What's the time?" [23:17:26][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [23:17:26][V][esp-idf:000]: I (327789) I2S: DMA queue destroyed

[23:17:26][D][voice_assistant:468]: Event Type: 5 [23:17:26][D][voice_assistant:599]: Unhandled event type: 5 [23:17:26][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [23:17:26][D][voice_assistant:468]: Event Type: 6 [23:17:26][D][voice_assistant:468]: Event Type: 7

[23:17:26][D][main:277]: VA: TTSStart [23:17:26][D][switch:016]: 'mute' Turning OFF. [23:17:26][D][switch:055]: 'mute': Sending state OFF [23:17:26][D][voice_assistant:468]: Event Type: 8 [23:17:26][V][esp-idf:000]: I (328170) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8

[23:17:26][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:30][D][i2s_audio.speaker:167]: Stopping I2S Audio Speaker [23:17:30][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:30][D][main:297]: VA: speaker stopped [23:17:30][D][switch:012]: 'mute' Turning ON. [23:17:30][D][switch:055]: 'mute': Sending state ON [23:17:30][D][light:036]: 'Muse2' Setting: [23:17:30][D][light:051]: Brightness: 40% [23:17:30][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:30][D][light:085]: Transition length: 1.0s

[23:17:32][D][voice_assistant:366]: State changed from STREAMING_RESPONSE to IDLE [23:17:32][D][voice_assistant:372]: Desired state set to IDLE [23:17:32][D][voice_assistant:366]: State changed from IDLE to START_PIPELINE [23:17:32][D][voice_assistant:372]: Desired state set to START_MICROPHONE [23:17:32][D][voice_assistant:124]: microphone not running [23:17:32][D][voice_assistant:206]: Requesting start... [23:17:32][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE [23:17:32][D][voice_assistant:124]: microphone not running [23:17:32][D][voice_assistant:387]: Client started, streaming microphone [23:17:32][D][voice_assistant:366]: State changed from STARTING_PIPELINE to START_MICROPHONE [23:17:32][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE [23:17:32][D][voice_assistant:159]: Starting Microphone [23:17:32][D][voice_assistant:366]: State changed from START_MICROPHONE to STARTING_MICROPHONE [23:17:32][V][esp-idf:000]: I (333630) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4

[23:17:32][D][voice_assistant:468]: Event Type: 1 [23:17:32][D][voice_assistant:471]: Assist Pipeline running [23:17:32][D][voice_assistant:366]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE [23:17:32][D][voice_assistant:468]: Event Type: 9 [23:17:36][D][voice_assistant:468]: Event Type: 10 [23:17:36][D][voice_assistant:477]: Wake word detected [23:17:36][D][voice_assistant:468]: Event Type: 3 [23:17:36][D][voice_assistant:482]: STT Started [23:17:36][D][main:463]: VA: listening [23:17:36][D][switch:012]: 'mute' Turning ON. [23:17:36][D][light:036]: 'Muse2' Setting: [23:17:36][D][light:051]: Brightness: 40% [23:17:36][D][light:059]: Red: 0%, Green: 0%, Blue: 100%

[23:17:39][D][voice_assistant:468]: Event Type: 4 [23:17:39][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [23:17:39][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE [23:17:39][D][voice_assistant:497]: Speech recognised as: "What's the time?" [23:17:39][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [23:17:39][V][esp-idf:000]: I (340570) I2S: DMA queue destroyed

[23:17:39][D][voice_assistant:468]: Event Type: 5 [23:17:39][D][voice_assistant:599]: Unhandled event type: 5 [23:17:39][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [23:17:39][D][voice_assistant:468]: Event Type: 6 [23:17:39][D][voice_assistant:468]: Event Type: 7

[23:17:39][D][main:277]: VA: TTSStart [23:17:39][D][switch:016]: 'mute' Turning OFF. [23:17:39][D][switch:055]: 'mute': Sending state OFF [23:17:39][D][voice_assistant:468]: Event Type: 8 [23:17:39][D][voice_assistant:538]: Response URL: "https://XXX.ui.nabu.casa/api/tts_proxy/a26788952fdb71fd098d8c7e52253145ec678685_en-au_72fdaba41d_cloud.raw" [23:17:39][D][voice_assistant:366]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE [23:17:39][D][voice_assistant:372]: Desired state set to STREAMING_RESPONSE [23:17:39][D][main:283]: VA: TTS END [23:17:39][D][i2s_audio.speaker:161]: Starting I2S Audio Speaker [23:17:39][D][voice_assistant:468]: Event Type: 2 [23:17:39][D][voice_assistant:550]: Assist Pipeline ended [23:17:39][D][main:288]: VA: ON END [23:17:39][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:43][D][i2s_audio.speaker:167]: Stopping I2S Audio Speaker [23:17:43][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:43][D][main:297]: VA: speaker stopped [23:17:43][D][switch:012]: 'mute' Turning ON. [23:17:43][D][switch:055]: 'mute': Sending state ON [23:17:43][D][light:036]: 'Muse2' Setting: [23:17:43][D][light:051]: Brightness: 40% [23:17:43][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:43][D][light:085]: Transition length: 1.0s

Interestingly the responseURL was not logged on the second run...

s00500 commented 11 months ago

So another interesting note here is that this also happens in a similar fashion when using the media player.

check this logs:

[00:00:34][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[00:00:34][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE
[00:00:34][D][voice_assistant:497]: Speech recognised as: "What's the time?"
[00:00:34][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[00:00:34][D][voice_assistant:468]: Event Type: 5
[00:00:34][D][voice_assistant:599]: Unhandled event type: 5
[00:00:34][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[00:00:35][D][voice_assistant:468]: Event Type: 6
[00:00:35][D][voice_assistant:468]: Event Type: 7
[00:00:35][D][voice_assistant:520]: Response: "The Time is now 0 0"
[00:00:35][D][light:036]: 'Muse2' Setting:
[00:00:35][D][light:051]:   Brightness: 100%
WARNING muse2.local: Connection error occurred: [Errno 54] Connection reset by peer
INFO Processing unexpected disconnect from ESPHome API for muse2.local
WARNING Disconnected from API
INFO Successfully connected to muse2.local
jesserockz commented 11 months ago

I believe the issue here is that the tts process takes too long on generating a new sentence and ESPHome only has a 2-3 second timer before it gives up on waiting for the audio stream to start.

When using simple commands like turn on light etc, the tts audio is cached in the filesystem and that is sent instantly

s00500 commented 11 months ago

that sounds like a good theory. If you can point me to where logic around that is placed I could test a little more on it when I have time, I have so far mainly looked at the state machine in the voice_assistant component

The thing that does nto align 100% with wour theory is that I ALSO have some issues when using the media_player not with the voiceassistant integration but manually via a lambda in the on_tts_end of voice assistant... at this point the response url is definitly created and sent... but it seems to me like the first initial request still fails somehow.

I almost all my tests ESPHome did activate the speaker for a while. but it did not make any sound...

my debugging has not been too precise yet =P But this issue must be very replaceable, does this not happen on eg the M5 Stack ?

s00500 commented 11 months ago

Ok, did play with this further. When using the media_player it actually works completely fine, no matter if you use the media_player key on the voice_assistent config or a custom lambda in the on_tts_end event

The only thing you can NOT have is any defined speaker. if a speaker AND a media player are defined then it desconnects everytime it is asked to play by the voice assistant.

So that means currently it works with the mediaplayer but NOT with the spealer integration. there it only woks after the esp has already played the file once strangely