Open s00500 opened 11 months ago
This seems really strange as ESPHome does not care about the sentence type etc. It is just sent audio via udp from HA and streams that to the speaker.
Have you verified the linked audio file works on your computer?
since I get that as a .raw ... what do I do to test it on my computer ? When I use the same input to send stuff to my sonos it works... but then I assume it uses the wav link ?
Here is a file that worked: 2e70d00c1e5e3ff0ee3a2bda989c7cbb635ad619_en-au_72fdaba41d_cloud.raw.working.zip
And here is one that failed: 4268b851211cd06cf30fc58a2069898d5d27fec6_en-au_72fdaba41d_cloud.raw.failing.zip
Judging by the size of the files both seem to be ok... could this have to do something with timing of the response ? I assume when I add time delays to the on_tts_start: and similar handlers that they do not block the main execution ? or would they ?
Ok, I just upgraded to Version: 2023.10.1 and this fixes that it locks up wake word detection after a fail...
But still it would not play audio on the first time
That means since my template changes every minute I can always get it to trigger properly if I just quickly ask it again.
The question is why does it not work the very first time... I assume that the first time the audio file takes longer to generate and is not cached... and for some reason if screws up... esphome actually goes into speaker playing state anyways.... I tried to experiment with loglevel to see further what is going on...
This issue feels very reproducable though.... Just use muse proto config from repo, then use homeassistant cloud as tts response in your pipeline and create some template response that changes
Here is a full pass, first failed, second worked, third also worked (was audible): All 3 texts are the same
[23:17:09][I][app:102]: ESPHome version 2023.10.1 compiled on Oct 19 2023, 23:11:42
[23:17:09][D][voice_assistant:468]: Event Type: 0 [23:17:09][D][voice_assistant:468]: Event Type: 2 [23:17:09][D][voice_assistant:550]: Assist Pipeline ended [23:17:09][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to IDLE [23:17:09][D][voice_assistant:372]: Desired state set to IDLE [23:17:09][D][main:288]: VA: ON END [23:17:09][D][main:297]: VA: speaker stopped [23:17:09][D][switch:012]: 'mute' Turning ON. [23:17:09][D][light:036]: 'Muse2' Setting: [23:17:09][D][light:051]: Brightness: 40% [23:17:09][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:09][D][light:085]: Transition length: 1.0s [23:17:09][D][voice_assistant:366]: State changed from IDLE to START_PIPELINE [23:17:09][D][voice_assistant:372]: Desired state set to START_MICROPHONE [23:17:09][D][voice_assistant:206]: Requesting start... [23:17:09][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE [23:17:09][D][voice_assistant:387]: Client started, streaming microphone [23:17:09][D][voice_assistant:366]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE [23:17:09][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE [23:17:09][D][voice_assistant:468]: Event Type: 1 [23:17:09][D][voice_assistant:471]: Assist Pipeline running [23:17:09][D][voice_assistant:468]: Event Type: 9 [23:17:11][D][voice_assistant:468]: Event Type: 10 [23:17:11][D][voice_assistant:477]: Wake word detected [23:17:11][D][voice_assistant:468]: Event Type: 3 [23:17:11][D][voice_assistant:482]: STT Started [23:17:11][D][main:463]: VA: listening [23:17:11][D][switch:012]: 'mute' Turning ON. [23:17:11][D][light:036]: 'Muse2' Setting: [23:17:11][D][light:051]: Brightness: 40% [23:17:11][D][light:059]: Red: 0%, Green: 0%, Blue: 100%
[23:17:14][D][voice_assistant:468]: Event Type: 4 [23:17:14][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [23:17:14][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE [23:17:14][D][voice_assistant:497]: Speech recognised as: "What's the time?" [23:17:14][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [23:17:14][V][esp-idf:000]: I (315696) I2S: DMA queue destroyed
[23:17:14][D][voice_assistant:468]: Event Type: 5 [23:17:14][D][voice_assistant:599]: Unhandled event type: 5 [23:17:14][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [23:17:14][D][voice_assistant:468]: Event Type: 6 [23:17:14][D][voice_assistant:468]: Event Type: 7
[23:17:14][D][main:277]: VA: TTSStart [23:17:14][D][switch:016]: 'mute' Turning OFF. [23:17:14][D][switch:055]: 'mute': Sending state OFF [23:17:14][D][voice_assistant:468]: Event Type: 8 [23:17:14][D][voice_assistant:538]: Response URL: "https://XXX.ui.nabu.casa/api/tts_proxy/a26788952fdb71fd098d8c7e52253145ec678685_en-au_72fdaba41d_cloud.raw" [23:17:14][D][voice_assistant:366]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE [23:17:14][D][voice_assistant:372]: Desired state set to STREAMING_RESPONSE [23:17:14][D][main:283]: VA: TTS END [23:17:14][D][i2s_audio.speaker:161]: Starting I2S Audio Speaker [23:17:14][D][voice_assistant:468]: Event Type: 2 [23:17:14][D][voice_assistant:550]: Assist Pipeline ended [23:17:14][D][main:288]: VA: ON END [23:17:14][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:14][D][i2s_audio.speaker:167]: Stopping I2S Audio Speaker [23:17:14][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:14][D][main:297]: VA: speaker stopped [23:17:14][D][switch:012]: 'mute' Turning ON. [23:17:14][D][switch:055]: 'mute': Sending state ON [23:17:14][D][light:036]: 'Muse2' Setting: [23:17:14][D][light:051]: Brightness: 40% [23:17:14][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:14][D][light:085]: Transition length: 1.0s
[23:17:15][D][i2s_audio.speaker:161]: Starting I2S Audio Speaker [23:17:15][V][esp-idf:000]: I (316687) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[23:17:15][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:18][V][esp-idf:000]: I (320080) I2S: DMA queue destroyed
[23:17:18][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:20][D][voice_assistant:366]: State changed from STREAMING_RESPONSE to IDLE [23:17:20][D][voice_assistant:372]: Desired state set to IDLE [23:17:20][D][voice_assistant:366]: State changed from IDLE to START_PIPELINE [23:17:20][D][voice_assistant:372]: Desired state set to START_MICROPHONE [23:17:20][D][voice_assistant:124]: microphone not running [23:17:20][D][voice_assistant:206]: Requesting start... [23:17:20][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE [23:17:20][D][voice_assistant:387]: Client started, streaming microphone [23:17:20][D][voice_assistant:366]: State changed from STARTING_PIPELINE to START_MICROPHONE [23:17:20][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE [23:17:20][D][voice_assistant:159]: Starting Microphone [23:17:20][D][voice_assistant:366]: State changed from START_MICROPHONE to STARTING_MICROPHONE [23:17:20][V][esp-idf:000]: I (322148) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4
[23:17:20][D][voice_assistant:468]: Event Type: 1 [23:17:20][D][voice_assistant:471]: Assist Pipeline running [23:17:20][D][voice_assistant:366]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE [23:17:20][D][voice_assistant:468]: Event Type: 9 [23:17:23][D][voice_assistant:468]: Event Type: 10 [23:17:23][D][voice_assistant:477]: Wake word detected [23:17:23][D][voice_assistant:468]: Event Type: 3 [23:17:23][D][voice_assistant:482]: STT Started [23:17:23][D][main:463]: VA: listening [23:17:23][D][switch:012]: 'mute' Turning ON. [23:17:23][D][light:036]: 'Muse2' Setting: [23:17:23][D][light:051]: Brightness: 40% [23:17:23][D][light:059]: Red: 0%, Green: 0%, Blue: 100%
[23:17:26][D][voice_assistant:468]: Event Type: 4 [23:17:26][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [23:17:26][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE [23:17:26][D][voice_assistant:497]: Speech recognised as: "What's the time?" [23:17:26][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [23:17:26][V][esp-idf:000]: I (327789) I2S: DMA queue destroyed
[23:17:26][D][voice_assistant:468]: Event Type: 5 [23:17:26][D][voice_assistant:599]: Unhandled event type: 5 [23:17:26][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [23:17:26][D][voice_assistant:468]: Event Type: 6 [23:17:26][D][voice_assistant:468]: Event Type: 7
[23:17:26][D][main:277]: VA: TTSStart [23:17:26][D][switch:016]: 'mute' Turning OFF. [23:17:26][D][switch:055]: 'mute': Sending state OFF [23:17:26][D][voice_assistant:468]: Event Type: 8 [23:17:26][V][esp-idf:000]: I (328170) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[23:17:26][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:30][D][i2s_audio.speaker:167]: Stopping I2S Audio Speaker [23:17:30][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:30][D][main:297]: VA: speaker stopped [23:17:30][D][switch:012]: 'mute' Turning ON. [23:17:30][D][switch:055]: 'mute': Sending state ON [23:17:30][D][light:036]: 'Muse2' Setting: [23:17:30][D][light:051]: Brightness: 40% [23:17:30][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:30][D][light:085]: Transition length: 1.0s
[23:17:32][D][voice_assistant:366]: State changed from STREAMING_RESPONSE to IDLE [23:17:32][D][voice_assistant:372]: Desired state set to IDLE [23:17:32][D][voice_assistant:366]: State changed from IDLE to START_PIPELINE [23:17:32][D][voice_assistant:372]: Desired state set to START_MICROPHONE [23:17:32][D][voice_assistant:124]: microphone not running [23:17:32][D][voice_assistant:206]: Requesting start... [23:17:32][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE [23:17:32][D][voice_assistant:124]: microphone not running [23:17:32][D][voice_assistant:387]: Client started, streaming microphone [23:17:32][D][voice_assistant:366]: State changed from STARTING_PIPELINE to START_MICROPHONE [23:17:32][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE [23:17:32][D][voice_assistant:159]: Starting Microphone [23:17:32][D][voice_assistant:366]: State changed from START_MICROPHONE to STARTING_MICROPHONE [23:17:32][V][esp-idf:000]: I (333630) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4
[23:17:32][D][voice_assistant:468]: Event Type: 1 [23:17:32][D][voice_assistant:471]: Assist Pipeline running [23:17:32][D][voice_assistant:366]: State changed from STARTING_MICROPHONE to STREAMING_MICROPHONE [23:17:32][D][voice_assistant:468]: Event Type: 9 [23:17:36][D][voice_assistant:468]: Event Type: 10 [23:17:36][D][voice_assistant:477]: Wake word detected [23:17:36][D][voice_assistant:468]: Event Type: 3 [23:17:36][D][voice_assistant:482]: STT Started [23:17:36][D][main:463]: VA: listening [23:17:36][D][switch:012]: 'mute' Turning ON. [23:17:36][D][light:036]: 'Muse2' Setting: [23:17:36][D][light:051]: Brightness: 40% [23:17:36][D][light:059]: Red: 0%, Green: 0%, Blue: 100%
[23:17:39][D][voice_assistant:468]: Event Type: 4 [23:17:39][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE [23:17:39][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE [23:17:39][D][voice_assistant:497]: Speech recognised as: "What's the time?" [23:17:39][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE [23:17:39][V][esp-idf:000]: I (340570) I2S: DMA queue destroyed
[23:17:39][D][voice_assistant:468]: Event Type: 5 [23:17:39][D][voice_assistant:599]: Unhandled event type: 5 [23:17:39][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE [23:17:39][D][voice_assistant:468]: Event Type: 6 [23:17:39][D][voice_assistant:468]: Event Type: 7
[23:17:39][D][main:277]: VA: TTSStart [23:17:39][D][switch:016]: 'mute' Turning OFF. [23:17:39][D][switch:055]: 'mute': Sending state OFF [23:17:39][D][voice_assistant:468]: Event Type: 8 [23:17:39][D][voice_assistant:538]: Response URL: "https://XXX.ui.nabu.casa/api/tts_proxy/a26788952fdb71fd098d8c7e52253145ec678685_en-au_72fdaba41d_cloud.raw" [23:17:39][D][voice_assistant:366]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE [23:17:39][D][voice_assistant:372]: Desired state set to STREAMING_RESPONSE [23:17:39][D][main:283]: VA: TTS END [23:17:39][D][i2s_audio.speaker:161]: Starting I2S Audio Speaker [23:17:39][D][voice_assistant:468]: Event Type: 2 [23:17:39][D][voice_assistant:550]: Assist Pipeline ended [23:17:39][D][main:288]: VA: ON END [23:17:39][D][i2s_audio.speaker:164]: Started I2S Audio Speaker [23:17:43][D][i2s_audio.speaker:167]: Stopping I2S Audio Speaker [23:17:43][D][i2s_audio.speaker:178]: Stopped I2S Audio Speaker [23:17:43][D][main:297]: VA: speaker stopped [23:17:43][D][switch:012]: 'mute' Turning ON. [23:17:43][D][switch:055]: 'mute': Sending state ON [23:17:43][D][light:036]: 'Muse2' Setting: [23:17:43][D][light:051]: Brightness: 40% [23:17:43][D][light:059]: Red: 100%, Green: 0%, Blue: 100% [23:17:43][D][light:085]: Transition length: 1.0s
Interestingly the responseURL was not logged on the second run...
So another interesting note here is that this also happens in a similar fashion when using the media player.
check this logs:
[00:00:34][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[00:00:34][D][voice_assistant:372]: Desired state set to AWAITING_RESPONSE
[00:00:34][D][voice_assistant:497]: Speech recognised as: "What's the time?"
[00:00:34][D][voice_assistant:366]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[00:00:34][D][voice_assistant:468]: Event Type: 5
[00:00:34][D][voice_assistant:599]: Unhandled event type: 5
[00:00:34][D][voice_assistant:366]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[00:00:35][D][voice_assistant:468]: Event Type: 6
[00:00:35][D][voice_assistant:468]: Event Type: 7
[00:00:35][D][voice_assistant:520]: Response: "The Time is now 0 0"
[00:00:35][D][light:036]: 'Muse2' Setting:
[00:00:35][D][light:051]: Brightness: 100%
WARNING muse2.local: Connection error occurred: [Errno 54] Connection reset by peer
INFO Processing unexpected disconnect from ESPHome API for muse2.local
WARNING Disconnected from API
INFO Successfully connected to muse2.local
I believe the issue here is that the tts process takes too long on generating a new sentence and ESPHome only has a 2-3 second timer before it gives up on waiting for the audio stream to start.
When using simple commands like turn on light etc, the tts audio is cached in the filesystem and that is sent instantly
that sounds like a good theory. If you can point me to where logic around that is placed I could test a little more on it when I have time, I have so far mainly looked at the state machine in the voice_assistant component
The thing that does nto align 100% with wour theory is that I ALSO have some issues when using the media_player not with the voiceassistant integration but manually via a lambda in the on_tts_end of voice assistant... at this point the response url is definitly created and sent... but it seems to me like the first initial request still fails somehow.
I almost all my tests ESPHome did activate the speaker for a while. but it did not make any sound...
my debugging has not been too precise yet =P But this issue must be very replaceable, does this not happen on eg the M5 Stack ?
Ok, did play with this further. When using the media_player it actually works completely fine, no matter if you use the media_player key on the voice_assistent config or a custom lambda in the on_tts_end event
The only thing you can NOT have is any defined speaker. if a speaker AND a media player are defined then it desconnects everytime it is asked to play by the voice assistant.
So that means currently it works with the mediaplayer but NOT with the spealer integration. there it only woks after the esp has already played the file once strangely
The problem
I am running assist on a RaspiAudio Muse Proto (the config I patched yesterday, with an added switch to debug mute_pinn state, but that is fine now!)
When I use a trigger that changes an entety everything works fine, but I have a custom intent for "Whats the time" It is ony an intent that throws back a templated sentence with the time. Then I use a static sentence everything works, but when I use a template things start to be weird:
Esphome stops and does not speak the audio even though it receives the link from the HA cloud tts and everything seems fine. Then it stops to listen for the wakeword... detection will only start again when I stop and start wakeword listening using the esphome switch for wakeword in HA ....
This is really a strange issue... anyone have any ideas what to test ?
Which version of ESPHome has the issue?
2023.10.0b1
What type of installation are you using?
Home Assistant Add-on
Which version of Home Assistant has the issue?
No response
What platform are you using?
ESP32-IDF
Board
RaspiAudio Muse Proto
Component causing the issue
voice_assistant
Example YAML snippet
Anything in the logs that might be useful for us?
Additional information
No response