Support for ESP32-S3-BOX peripherals + voice_assistant

rpatel3001 commented 1 year ago

Describe the problem you have/What new integration you would like

Main features: support for peripherals on the ESP32-S3-BOX dev kit:

[x] ILI9342C LCD driver [already done]
- [x] Allow X mirroring, [PR merged]
- [x] support for BOX-Lite LCD driver [PR merged]
[x] TT21100 touch screen [PR merged]
- [x] Report state at poweron [added to PR]
[ ] ES7210 ADC I2C [working(?) work in progress]
- [ ] get user guide
- [ ] Get it to be more stable
[ ] ES8311 Codec I2C [mostly working work in progress]
[x] add MCLK pin to i2s_audio [PR merged]
[ ] ~ICM-42607-P IMU~ (I'm going to ignore this, I don't think most people have any use for it)

To get voice_assistant working:

[x] voice_assistant is not compatible with docker bridge networking, must use macvlan/ipvlan/host mode
[ ] get media_player to work with .raw streams (speaker component works fine)

Architectural changes to support wakeword and esp-idf framework (probably out of scope here and will be transferred to a new issue or 3 once the S3-BOX works for on-demand voice commands):

[x] Debug bootloops when using esp-idf framework
- No longer happens
[x] Add esp-adf, skainet, etc
- esp-adf added, wake word done remotely
[ ] Update i2s_audio_media_player
- [ ] Refactor current library to only handle audio streams; move i2s setup into esphome proper
- [ ] Possibly switch to an audio library that doesn't require arduino framework?

Please describe your use case for this integration and alternatives you've tried:

Use the peripherals on the board. Working on-demand voice_assistant.

Additional context

This device has recently had a bit of attention due to posts about Willow on hackernews and elsewhere. Willow is fantastic but I'd like to be able to use the full extent of existing esphome components, and I bet others would also. Adding hardware peripherals is the smallest part of this, wake word detection is the major missing feature missing to make esphome a viable alternative (out of scope for this feature request though).

Reference links: https://github.com/espressif/esp-box https://github.com/toverainc/willow https://github.com/hugobloem/esp-ha-speech https://github.com/espressif/esp-dev-kits/issues/24#issuecomment-781314125 https://components.espressif.com/components/espressif/es8311 https://components.espressif.com/components/espressif/es7210 https://github.com/espressif/esp-bsp/ https://github.com/espressif/esp-adf/

sehraf commented 1 year ago

I'm a bit confused. Is it possible to get the speaker to make noise beyond clicking yet?

You can stream audio to the speaker using ffmpeg https://github.com/sehraf/esphome-components#stream-audio-to-speaker-component

rpatel3001 commented 1 year ago

@richlawson Clicking around in the radio browser, I was able to play streams that are both http and https (my TTS is coming from https).

[02:24:21][D][media_player:066]:   Media URL: https://icecast.walmradio.com:8443/classic
[02:24:21][16765867][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:24:21][16766373][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 233024
[02:24:21][16766373][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[02:24:21][16766688][V][ssl_client.cpp:149] start_ssl_client(): Seeding the random number generator
[02:24:21][16766690][V][ssl_client.cpp:158] start_ssl_client(): Setting up the SSL/TLS structure...
[02:24:22][16766693][D][ssl_client.cpp:179] start_ssl_client(): WARNING: Skipping SSL Verification. INSECURE!
[02:24:22][16766702][V][ssl_client.cpp:257] start_ssl_client(): Setting hostname for TLS session...
[02:24:22][16766710][V][ssl_client.cpp:272] start_ssl_client(): Performing the SSL/TLS handshake...
[02:24:22][16767592][V][ssl_client.cpp:293] start_ssl_client(): Verifying peer X.509 certificate...
[02:24:22][16767593][V][ssl_client.cpp:301] start_ssl_client(): Certificate verified.
[02:24:22][16767596][V][ssl_client.cpp:316] start_ssl_client(): Free internal heap after TLS 189656
[02:24:22][16767604][V][ssl_client.cpp:369] send_ssl_data(): Writing HTTP request with 158 bytes...
[02:24:38][D][media_player:059]: 'Media Player' - Setting
[02:24:38][D][media_player:069]:   Volume: 1.00
[02:24:44][D][media_player:059]: 'Media Player' - Setting
[02:24:44][D][media_player:066]:   Media URL: http://www.101smoothjazz.com/101-smoothjazz.m3u
[02:24:44][16788892][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:24:47][16791711][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:24:54][D][media_player:059]: 'Media Player' - Setting
[02:24:54][D][media_player:066]:   Media URL: http://st01.dlf.de/dlf/01/128/mp3/stream.mp3
[02:24:54][16799205][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:24:55][16800058][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:25:03][D][media_player:059]: 'Media Player' - Setting
[02:25:03][D][media_player:066]:   Media URL: http://dancewave.online/dance.mp3.pls
[02:25:03][16807965][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:25:04][16809105][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:25:05][16810305][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:25:11][D][media_player:059]: 'Media Player' - Setting
[02:25:11][D][media_player:066]:   Media URL: https://icecast.walmradio.com:8443/classic
[02:25:11][16815941][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[02:25:11][16816124][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 231768
[02:25:11][16816125][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[02:25:11][16816357][V][ssl_client.cpp:149] start_ssl_client(): Seeding the random number generator
[02:25:11][16816359][V][ssl_client.cpp:158] start_ssl_client(): Setting up the SSL/TLS structure...
[02:25:11][16816362][D][ssl_client.cpp:179] start_ssl_client(): WARNING: Skipping SSL Verification. INSECURE!
[02:25:11][16816370][V][ssl_client.cpp:257] start_ssl_client(): Setting hostname for TLS session...
[02:25:11][16816378][V][ssl_client.cpp:272] start_ssl_client(): Performing the SSL/TLS handshake...
[02:25:12][16817055][V][ssl_client.cpp:293] start_ssl_client(): Verifying peer X.509 certificate...
[02:25:12][16817056][V][ssl_client.cpp:301] start_ssl_client(): Certificate verified.
[02:25:12][16817059][V][ssl_client.cpp:316] start_ssl_client(): Free internal heap after TLS 188392
[02:25:12][16817067][V][ssl_client.cpp:369] send_ssl_data(): Writing HTTP request with 158 bytes...
[02:25:42][D][media_player:059]: 'Media Player' - Setting
[02:25:42][D][media_player:066]:   Media URL: http://nashe1.hostingradio.ru/rock-128.mp3
[02:25:42][16846727][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.

What esphome and HA versions are you on? I'm on Home Assistant 2023.5.3 and esphome 2023.6.2

richlawson commented 1 year ago

@rpatel3001, I'm on the latest: HA 2023.6.3, HASS OS 10.3, ESPHome v2023.6.2.

Radio browser works when I play it on my browser (over HTTP), even for streams that I tried and failed with those errors on the Box.

Similarly, TTS works on a tablet I have set up using the mobile app, and TTS also works over a browser.

EDIT: I re-ran the install, and I noticed an error when it was setting up the audio, but it looks like it's for the mic. I must have missed them last night:

[08:01:12][C][i2s_audio:024]: Setting up I2S Audio...
[08:01:12][C][i2s_audio.microphone:016]: Setting up I2S Audio Microphone...
[08:01:12][V][esp32.preferences:059]: nvs_get_blob('372285942'): ESP_ERR_NVS_NOT_FOUND - the key might not be set yet
[08:01:12][V][wifi_esp32:039]: Enabling STA.

rpatel3001 commented 1 year ago

I suspect that error is to do with wifi, not the mic. Are you certain the volume on the media player is not set very low?

snechiporenko commented 1 year ago

I checked. All works. But not stable (sound)

richlawson commented 1 year ago

I suspect that error is to do with wifi, not the mic. Are you certain the volume on the media player is not set very low? I did check the volume.

I also saw that the Media Player was resetting back to idle within 1-2 seconds, even for a 4+ second TTS wav file. So unfortunately it's not a volume issue.

Here are logs of me successfully turning off the LCD backlight, then trying to use piper TTS, and then trying to use the Radio Browser. After trying to use the Radio Browser, it crashed this time:

``` [17:00:31][D][light:036]: 'LCD Backlight' Setting: [17:00:31][D][light:047]: State: OFF [17:00:31][D][light:085]: Transition length: 1.0s [17:00:55][D][media_player:059]: 'Media Player' - Setting [17:00:55][D][media_player:066]: Media URL: http://home-assistant.local:8123/api/tts_proxy/32b7cbdc35a6c367b425528d61d48e8570a81c95_en-us_22597d2fbc_tts.piper.wav [17:00:55][101456][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection. [17:00:57][103586][E][WiFiClient.cpp:268] connect(): socket error on fd 52, errno: 104, "Connection reset by peer" [17:00:57][103759][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection. [17:00:57][103808][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection. [17:01:39][D][media_player:059]: 'Media Player' - Setting [17:01:39][D][media_player:066]: Media URL: https://icecast.walmradio.com:8443/classic [17:01:39][146085][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection. [17:01:40][146592][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 233648 [17:01:40][146592][V][ssl_client.cpp:68] start_ssl_client(): Starting socket [17:01:44]E (158734) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time: [17:01:44]E (158734) task_wdt: - loopTask (CPU 1) [17:01:44]E (158734) task_wdt: Tasks currently running: [17:01:44]E (158734) task_wdt: CPU 0: IDLE [17:01:44]E (158734) task_wdt: CPU 1: IDLE [17:01:44]E (158734) task_wdt: Aborting. ```

``` [17:01:44]ESP-ROM:esp32s3-20210327 [17:01:44]Build:Mar 27 2021 [17:01:44]rst:0xc (RTC_SW_CPU_RST),boot:0xa (SPI_FAST_FLASH_BOOT) [17:01:44]Saved PC:0x40377564 WARNING Decoded 0x40377564: esp_restart_noos at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/port/soc/esp32s3/system_internal.c:143 (discriminator 1) [17:01:44]SPIWP:0xee [17:01:44]mode:DIO, clock div:1 [17:01:44]load:0x3fce3808,len:0x43c [17:01:44]load:0x403c9700,len:0xbec [17:01:44]load:0x403cc700,len:0x2a3c [17:01:44]entry 0x403c98d8 [17:01:44][ 243][I][esp32-hal-psram.c:96] psramInit(): PSRAM enabled [17:01:44][I][logger:262]: Log initialized [17:01:44][C][ota:469]: There have been 2 suspected unsuccessful boot attempts. [17:01:44][D][esp32.preferences:114]: Saving 1 preferences to flash... [17:01:44][V][esp32.preferences:126]: sync: key: 233825507, len: 4 [17:01:44][D][esp32.preferences:143]: Saving 1 preferences to flash: 0 cached, 1 written, 0 failed [17:01:44][I][app:029]: Running through setup()... [17:01:44][V][app:030]: Sorting components by setup priority... [17:01:44][C][spi:023]: Setting up SPI bus... [17:01:44][I][i2c.arduino:183]: Performing I2C bus recovery [17:01:44][ 263][I][esp32-hal-i2c.c:75] i2cInit(): Initialising I2C Master: sda=8 scl=18 freq=100000 [17:01:44][V][i2c.arduino:048]: Scanning i2c bus for active devices... [17:01:45][C][switch.gpio:011]: Setting up GPIO Switch 'Mute'... [17:01:45][D][switch:016]: 'Mute' Turning OFF. [17:01:45][D][switch:055]: 'Mute': Sending state OFF [17:01:45][D][switch:016]: 'Mute' Turning OFF. [17:01:45][D][binary_sensor:034]: 'Muted': Sending initial state OFF [17:01:45][D][binary_sensor:034]: 'Settings': Sending initial state OFF [17:01:45][C][light:035]: Setting up light 'RGB LED'... [17:01:45][D][light:036]: 'RGB LED' Setting: [17:01:45][D][light:041]: Color mode: RGB [17:01:45][D][light:085]: Transition length: 1.0s [17:01:45][C][light:035]: Setting up light 'LCD Backlight'... [17:01:45][D][light:036]: 'LCD Backlight' Setting: [17:01:45][D][light:041]: Color mode: [17:01:45][D][light:047]: State: ON [17:01:45][D][light:085]: Transition length: 1.0s [17:01:45][C][i2s_audio:024]: Setting up I2S Audio... [17:01:45][C][i2s_audio.microphone:016]: Setting up I2S Audio Microphone... [17:01:45][V][esp32.preferences:059]: nvs_get_blob('372285942'): ESP_ERR_NVS_NOT_FOUND - the key might not be set yet [17:01:45][V][wifi_esp32:039]: Enabling STA. [17:01:45][ 710][D][WiFiGeneric.cpp:929] _eventCallback(): Arduino Event: 0 - WIFI_READY [17:01:45][V][wifi_esp32:454]: Event: WiFi ready [17:01:45][ 760][V][WiFiGeneric.cpp:338] _arduino_event_cb(): STA Started [17:01:45][ 760][D][WiFiGeneric.cpp:929] _eventCallback(): Arduino Event: 2 - STA_START [17:01:45][V][wifi_esp32:469]: Event: WiFi STA start [17:01:51][ 6477][V][WiFiGeneric.cpp:381] _arduino_event_cb(): SCAN Done: ID: 128, Status: 0, Results: 3 [17:01:51][ 6477][D][WiFiGeneric.cpp:929] _eventCallback(): Arduino Event: 1 - SCAN_DONE [17:01:51][V][wifi_esp32:463]: Event: WiFi Scan Done status=0 number=3 scan_id=128 ```

rpatel3001 commented 1 year ago

I'm honestly not sure where to go from there - clearly something is wrong but there's a handful of folks here who it's working for. It may be the newer HA version changed something, I'll update at some point and try it out.

cptskippy commented 1 year ago

I'm honestly not sure where to go from there - clearly something is wrong but there's a handful of folks here who it's working for. It may be the newer HA version changed something, I'll update at some point and try it out.

@rpatel3001, I got my ESP32-S3-BOX working the other day using a YAML file based on yours (thank you) and didn't encounter any audio issues when I played back TTS or an MP3. Shortly after that I started hacking a custom graphics library to play around with drawing on the LCD.

Today I noticed that audio playback was choppy and would fall out of sync with the same audio playing on another device. I tried restarting the device, playing different formats, and different bitrates without any any improvement. I reverted back to a YAML without my custom graphics and the issues immediately went away. It's definitely software related and not a hardware issue. I think maybe there's a buffer overflow or something writing data into the DAC's output, that would explain the audio falling out of sync.

I've done some testing and stripped down my custom code to the bare minimum and reduced the YAML as well. The code below will reliably reproduce the audio playback issues:

YAML:

substitutions:
  device_name: voice-assistant-1
  device_verbose_name: "Voice Assistant 1"

  wifi_ssid: !secret wifi_ssid
  wifi_password: !secret wifi_password
  wifi_hotspot_password: !secret wifi_hotspot_password

  api_encryption_key: !secret api_encryption_key
  ota_password: !secret ota_password

packages:
  device: !include templates/esp32.s3.template.yaml
  base: !include templates/wifi.template.yaml

esphome:
  includes:
    - display_buffer_wrapper.h

external_components:
  - source: github://pr#4793
    components: [ tt21100 ]
  - source: github://pr#4861
    components: [ es8311 ]

time:
  - platform: sntp
    id: time_sntp

font:
  - file: "gfonts://Roboto@500"
    id: font_large
    size: 70
    glyphs: "0123456789:APM."
  - file: "gfonts://Roboto@500"
    id: font_medium
    size: 30

image:
  - file: mdi:volume-off
    id: mute_icon
    resize: 40x40

spi:
  clk_pin: GPIO7
  mosi_pin: GPIO6

i2c:
  scl: GPIO18
  sda: GPIO8
  scan: true

es8311:
  address: 0x18

touchscreen:
  - platform: tt21100
    address: 0x24
    interrupt_pin: GPIO3

i2s_audio:
  i2s_lrclk_pin: GPIO47
  i2s_bclk_pin: GPIO17
  i2s_mclk_pin: GPIO2

microphone:
  - platform: i2s_audio
    id: ext_mic
    adc_type: external
    pdm: false
    i2s_din_pin: GPIO16
    bits_per_sample: 16bit

output:
  - platform: ledc
    id: lcd_backlight
    pin: GPIO45

switch:
  - platform: gpio
    id: ns4150_ctrl
    name: Mute
    pin: GPIO46
    inverted: true

light:
  - platform: monochromatic
    name: "LCD Backlight"
    output: lcd_backlight
    restore_mode: ALWAYS_ON

display:
  - platform: ili9xxx
    id: lcd
    model: s3box
    cs_pin: GPIO5
    dc_pin: GPIO4
    reset_pin: GPIO48
    # Width = 320, Height = 240
    lambda: |-
      auto bg = Color(250, 250, 250);
      auto text = Color(66, 66, 66);

      it.fill(Color::BLACK);

      auto dbw = DisplayBufferWrapper(it);
      dbw.filled_pill(0, 0, 320, 240, 10, bg);

      it.strftime(160, 65, id(font_large), text, TextAlign::CENTER, "%I:%M%p", id(time_sntp).now());
      it.strftime(160, 115, id(font_medium), text, TextAlign::CENTER, "%a, %b %e", id(time_sntp).now());
      if (id(muted).state) {
        it.image(320, 0, id(mute_icon), ImageAlign::TOP_RIGHT, text);
      }

binary_sensor:
  - platform: gpio
    id: settings
    name: "Settings"
    pin:
      number: GPIO0
      mode: INPUT_PULLUP
      inverted: true
    on_press:
      - voice_assistant.start:
    on_release:
      - voice_assistant.stop:

  - platform: gpio
    id: muted
    name: "Muted"
    pin:
      number: GPIO1
      inverted: true

media_player:
  - platform: i2s_audio
    id: ext_speaker
    name: Media Player
    dac_type: external
    i2s_dout_pin: GPIO15
    mute_pin:
      number: GPIO46
      inverted: true

voice_assistant:
  microphone: ext_mic
  media_player: ext_speaker

Custom C++ Code:

#include "esphome.h"

class DisplayBufferWrapper {
  private:
    DisplayBuffer& it;

  public:
    DisplayBufferWrapper(DisplayBuffer& displayBuffer) : it(displayBuffer) {}

    void filled_pill(int x1, int y1, int width, int height, int corner_radius, Color color = COLOR_ON) {
      it.filled_rectangle(x1+corner_radius, y1, width-corner_radius*2, height, color);
      it.filled_rectangle(x1, y1+corner_radius, width, height-corner_radius*2, color);

      // Top left
      it.filled_circle(x1+corner_radius, y1+corner_radius, corner_radius, color);
      // Top right
      it.filled_circle(x1+width-corner_radius-1, y1+corner_radius, corner_radius, color);
      // Bottom left
      it.filled_circle(x1+corner_radius, y1+height-corner_radius-1, corner_radius, color);
      // Bottom right
      it.filled_circle(x1+width-corner_radius-1, y1+height-corner_radius-1, corner_radius, color);
    }
 };

Simply commenting out these two lines makes the issue go away:

      auto dbw = DisplayBufferWrapper(it);
      dbw.filled_pill(0, 0, 320, 240, 10, bg);

I hope this helps.

rpatel3001 commented 1 year ago

I think maybe it's more likely that drawing and decoding audio are both CPU intensive tasks and the esp just can't keep up. I don't know the details of the display library, but does it block while drawing? If you draw smaller/simpler/less shapes does it help the audio issues? I'll play with it this weekend.

cptskippy commented 1 year ago

In the example above drawing a box, two text objects, and an image doesn't cause an issue. My custom library just draws two additional boxes and four circles. So nothing terribly complex but it just might be enough.

I'll play around with it and see if that's a possibility.

cptskippy commented 1 year ago

I think maybe it's more likely that drawing and decoding audio are both CPU intensive tasks and the esp just can't keep up.

@rpatel3001 that's exactly what it is.

I moved everything into the Lambda and the issue persisted. I dropped the display refresh interval down to every 30 seconds and the choppy audio only occurred when the screen was redrawn.

rpatel3001 commented 1 year ago

interestingly I'm only getting the choppiness with the radio, makes sense that the wav format HA uses for TTS (or maybe piper specifically) is easier to decode than mp3 from the web stream.

As a guess I think it's mostly the circles - doing the trig is probably pretty inefficient - but after removing those but leaving the rest I can still hear a very very slight, barely audible choppiness. Double strftime calls or the mute icon maybe? Doing anything fun with the display will need to be disabled while playing audio (maybe except for raw/wav files if that's possible to determine) or done more efficiently.

rpatel3001 commented 1 year ago

Copied from the HA issue thread:

Samples are definitely getting all the way to this->socket_->sendto() in esphome\components\voice_assistant\voice_assistant.cpp on the esphome side

But samples are not being received in the datagram_received() callback in core/homeassistant/components/esphome/voice_assistant.py

Not sure how to move forward from here besides learning how to use wireshark and installing it in the home assistant container.

devinhedge commented 1 year ago

Hiya. I’ve been monitoring this thread for a while.

I’m wanting to contribute but haven’t the time as of late.

Another approach might be to create a dummy target receiver (Mock APi) that mimics the HA API and dumps the data stream to a flat file.

Additionally, it’s good to create a simulator that you can use to make sure the Mock is working correctly.

At work, we do this with hardware/firmware/software testing of APIs and IoT devices.

EDIT: This may help you isolate where the data is stopping, if there is a bug in the ESPHome or HA API.

I'm curious if the best target should be the HA API or the part of Rhasspy that receives satellite streams.

rpatel3001 commented 1 year ago

I did something similar already (https://github.com/esphome/feature-requests/issues/2239#issuecomment-1606316908), not to the point of a whole api but streaming samples. The socket write was in a slightly different location, but I've confirmed that the samples are making it to the right place. The socket library is different, so I think it's isolated to the one socket.sendto call. I did a 10 minute test with Wireshark this morning, and unless I'm missing a config option to enable udp, I didn't see anything. 3 or 4 TCP packets when voice assistant is activated, then 3 or 4 more when it's stopped. The Wireshark command was something like tshark -i vethxxxx -f "src host <esp IP>" run from the machine hosting the Docker container.

snechiporenko commented 1 year ago

I think maybe it's more likely that drawing and decoding audio are both CPU intensive tasks and the esp just can't keep up.

@rpatel3001 that's exactly what it is.

I moved everything into the Lambda and the issue persisted. I dropped the display refresh interval down to every 30 seconds and the choppy audio only occurred when the screen was redrawn.

If you enable VERBOSE level for logger, you can see something like this: [23:26:25][V][component:204]: Component st7789v.display took a long time for an operation (0.24 s). [23:26:25][V][component:205]: Components should block for at most 20-30ms. So, we must stop update display when music (voice) is playing. Or speedup display component :-) Ref: https://github.com/esphome/esphome/pull/4956

hamishfagg commented 1 year ago

FWIW there's a ES7210 arduino example here, in the repo for the t-embed device: https://github.com/Xinyuan-LilyGO/T-Embed/tree/main/example/mic

gsgxnet commented 1 year ago

@rpatel3001 that's exactly what it is. I moved everything into the Lambda and the issue persisted. I dropped the display refresh interval down to every 30 seconds and the choppy audio only occurred when the screen was redrawn.

If you enable VERBOSE level for logger, you can see something like this: [23:26:25][V][component:204]: Component st7789v.display took a long time for an operation (0.24 s). [23:26:25][V][component:205]: Components should block for at most 20-30ms. So, we must stop update display when music (voice) is playing. Or speedup display component :-) Ref: esphome/esphome#4956

The ESP32-S3 is the strongest ESP32 system Espressif is offering, isn't it? At least they state so: esp32-s3
It comes as many esp32 SoCs with 2 cores. I guess ESPHome can not separate the drawing component to one MCU and the sound playing to another? I do not understand to many details about the ESP32 architecture. Are both MCU available for firmware code or is one MCU WiFi, BT etc only?

Espressif states too:

AI Acceleration Support ESP32-S3 has additional support for vector instructions in the MCU, which provides acceleration for neural network computing and signal processing workloads. Developers can take advantage of these vector instructions through ESP-DSP and ESP-NN libraries to optimize their applications. ESP-WHO and ESP-Skainet SDKs will also support this acceleration.

For me to understand your efforts better, are you trying to get that "AI" going as well?

rpatel3001 commented 1 year ago

For me to understand your efforts better, are you trying to get that "AI" going as well?

Eventually, I would like to get that working (esp-adf, esp-skainet, etc) but that'll be a separate feature request to add wake word and echo cancellation with multiple mics and feedback. This issue is just for basic functionality of the hardware and getting voice_assistant working on this device.

snechiporenko commented 1 year ago

As far as I understand, the Arduino framework does not utilize both cores of the processor simultaneously. That means there is a single infinite loop where all the subroutines from peripheral devices are processed sequentially. The FreeRTOS system is only supported in the official esp-idf, and its support is quite limited.

gsgxnet commented 1 year ago

Also I forked your box.yaml gist to add the RGB LED that comes with the kit,

As stated elsewhere, the box.yaml which is found in several slightly different variants here and there needs to be slightly modified, the framework needed at this moment, is

esp32:
  board: esp32s3box
  framework:
    type: arduino
    version: latest

otherwise I get an endless boot loop when I restart the S3 box another time after the flash boot. In detail

the box restarts fine and works as expected after the firmware flash.
power off and power on again restart will result in an endless boot loop, when flashed with type: arduino without defining a version. Which means version is default which is not latest but recommended

rpatel3001 commented 1 year ago

that has not been my experience, I'm not calling out latest and my setup seems to be stable.

rpatel3001 commented 1 year ago

I have finally succeeded in doing voice commands through the s3-box!

I had to:

add the UDP port in the ports: section of the docker compose file for the home assistant container
restart the container
hardcode the UDP port in home assistant (https://github.com/home-assistant/core/blob/4ff158a105e815c2323d02cf163bc7d193f319d8/homeassistant/components/esphome/voice_assistant.py#L35) (has to be repeated whenever the container is recreated)

I suppose that means this might "just work" already with host/macvlan networking or with HA OS/HA core.

Apparently media_player doesn't support piper's .raw responses, it just prints a bunch of [I][Audio.cpp:3427] playAudioData(): err bytesDecoded -1, but TTS audio responses work perfectly with the speaker component.

Also the decoding fails randomly and then doesn't come back until a power cycle, probably because the ADC init is still not quite right.

justinhunt1223 commented 1 year ago

I'm having issues using the config mostly provided by @rpatel3001. This same issue results in any other config as well. The assistant does not start. I've flashed it with the latest and recommended frameworks. Is there any way to get more info on why it is failing to start? Very verbose logs are telling me nothing. There are no other errors in the logs anywhere and all i2c addresses match that I can tell.

Log entries (very verbose):

[D][voice_assistant:132]: Requesting start...
[W][voice_assistant:134]: Could not request start.

YAML:

```yaml substitutions: device_name: "esp-kitchen" mqtt_name: esp_kitchen esphome: name: $device_name friendly_name: $device_name esp32: board: esp32s3box framework: type: arduino logger: level: VERY_VERBOSE api: encryption: key: "" ota: mqtt: broker: !secret mqtt_ip username: !secret mqtt_username password: !secret mqtt_password discovery: true birth_message: topic: esphome/$mqtt_name/status payload: online will_message: topic: esphome/$mqtt_name/status payload: offline wifi: ssid: !secret wifi_ssid password: !secret wifi_password ap: ssid: $device_name password: $device_name captive_portal: external_components: - source: github://pr#4793 components: [ tt21100 ] - source: github://pr#4861 components: [ es8311 ] - source: github://rpatel3001/esphome@es7210 components: [ es7210 ] time: - platform: homeassistant id: time_ha output: - platform: ledc id: rgb_red pin: GPIO39 - platform: ledc id: rgb_green pin: GPIO40 - platform: ledc id: rgb_blue pin: GPIO41 - platform: ledc pin: GPIO45 id: lcd_backlight light: - platform: rgb name: RGB LED red: rgb_red green: rgb_green blue: rgb_blue - platform: monochromatic output: lcd_backlight name: "LCD Backlight" restore_mode: ALWAYS_ON spi: clk_pin: GPIO7 mosi_pin: GPIO6 display: - platform: ili9xxx model: s3box cs_pin: GPIO5 dc_pin: GPIO4 reset_pin: GPIO48 id: lcd auto_clear_enabled: false # Width = 320, Height = 240 lambda: |- auto bg = Color(250, 250, 250); auto text = Color(66, 66, 66); it.fill(bg); auto red = Color(255, 0, 0); auto green = Color(0, 255, 0); auto blue = Color(0, 0, 255); it.filled_rectangle(10, 170, 60, 60, red); it.filled_rectangle(130, 170, 60, 60, green); it.filled_rectangle(250, 170, 60, 60, blue); it.strftime(160, 65, id(font_large), text, TextAlign::CENTER, "%H:%M", id(time_ha).now()); it.strftime(160, 115, id(font_medium), text, TextAlign::CENTER, "%a, %b %e", id(time_ha).now()); if (id(muted).state) { it.image(280, 0, id(mic_mute_icon), ImageAlign::TOP_RIGHT, text); } if (id(ext_speaker).is_muted()) { it.image(320, 0, id(mute_icon), ImageAlign::TOP_RIGHT, text); } if (id(voice_asst)->is_running()) { it.image(0, 0, id(voice_icon), ImageAlign::TOP_LEFT, text); } font: - file: "gfonts://Roboto@500" id: font_large size: 70 glyphs: "0123456789:APM." - file: "gfonts://Roboto@500" id: font_medium size: 30 image: - file: mdi:volume-off id: mute_icon resize: 40x40 - file: mdi:microphone-off id: mic_mute_icon resize: 40x40 - file: mdi:account-voice id: voice_icon resize: 40x40 i2c: scl: GPIO18 sda: GPIO8 scan: true touchscreen: - platform: tt21100 address: 0x24 interrupt_pin: GPIO3 # Don't use as the reset pin is shared with the display, so the ili9xxx will perform the reset #reset_pin: GPIO48 binary_sensor: - platform: gpio pin: number: GPIO0 mode: INPUT_PULLUP id: settings name: "Settings" on_press: - if: condition: voice_assistant.is_running then: - voice_assistant.stop: else: - voice_assistant.start_continuous: - platform: gpio pin: number: GPIO1 inverted: true id: muted name: "Muted" - platform: tt21100 name: "Home" index: 0 - platform: touchscreen name: "Red" x_min: 10 x_max: 70 y_min: 170 y_max: 230 - platform: touchscreen name: "Green" x_min: 130 x_max: 190 y_min: 170 y_max: 230 - platform: touchscreen name: "Blue" x_min: 250 x_max: 310 y_min: 170 y_max: 230 i2s_audio: i2s_lrclk_pin: GPIO47 i2s_bclk_pin: GPIO17 i2s_mclk_pin: GPIO2 es8311: address: 0x18 media_player: - platform: i2s_audio name: Media Player id: ext_speaker dac_type: external i2s_dout_pin: GPIO15 mute_pin: number: GPIO46 inverted: true es7210: address: 0x40 microphone: - platform: i2s_audio id: ext_mic adc_type: external pdm: false i2s_din_pin: GPIO16 bits_per_sample: 16bit voice_assistant: id: voice_asst microphone: ext_mic media_player: ext_speaker ```

rpatel3001 commented 1 year ago

I think that means you aren't connected to the home assistant API. Is the device added to your HA instance?

justinhunt1223 commented 1 year ago

I think that means you aren't connected to the home assistant API. Is the device added to your HA instance?

It is in my home assistant instance and updates via wifi work. It populates as an MQTT device and everything else seems to work just fine.

KTibow commented 1 year ago

an MQTT device

AFAIK Assist doesn't work over MQTT. You need to use the actual ESPHome component.

justinhunt1223 commented 1 year ago

an MQTT device

AFAIK Assist doesn't work over MQTT. You need to use the actual ESPHome component.

I am using the ESPHome add-on.

KTibow commented 1 year ago

The ESPHome addon doesn't do anything other than let you compile a config. It doesn't integrate the devices into HA.

justinhunt1223 commented 1 year ago

The ESPHome addon doesn't do anything other than let you compile a config. It doesn't integrate the devices into HA.

Ahh okay, that was the issue. Thank you!

Hellis81 commented 1 year ago

I just got my ESP Box, how do I get it in to flash mode? I can't find it in ESP-Home or ESP-Home flasher.

I tried to hold the boot button when connecting but it wont work, do I need some special drivers for this?

KTibow commented 1 year ago

Generally, the process is like

Plug the thing into the device with the ESPHome dashboard, via a USB cable with data
Press upload, choose plug into device with dashboard, choose the device

Hellis81 commented 1 year ago

Generally, the process is like

Plug the thing into the device with the ESPHome dashboard, via a USB cable with data

Press upload, choose plug into device with dashboard, choose the device

Major facepalm, the cable I have been using for an hour or so does not have data. I (wrongly) assumed they had stopped making cables without data.

snechiporenko commented 1 year ago

Work in progress? https://github.com/esphome/esphome/pull/5230

rpatel3001 commented 1 year ago

Nice, ADF is a great stepping stone to wake word and the advanced audio stuff willow is doing (combining multiple mics, DAC cancelation). Will monitor that PR.

qJake commented 1 year ago

I just got my ESP32-S3-BOX in the mail. I'm using the box.yaml from @rpatel3001 (thanks!).

I've added the ESPHome device into Home Assistant, and I can see in HA when I press the mic mute or "Assistant" top-left button. But I can't seem to get any audio output or get it to register my voice into STT even if I manually press the assistant button.

The screen is also blank (white), but that's a separate issue. I mostly bought this for the mic+speakers.

What step did I miss to get this to behave like a voice assistant in HA? I do have Piper and Whisper set up and if I try it with my browser, it works just fine. But I can't seem to get it to work with the ESP32 device.

KTibow commented 1 year ago

What step did I miss to get this to behave like a voice assistant in HA?

read latest comments, rpatel3001 is still working on getting the audio sent to home assistant

rpatel3001 commented 1 year ago

Not working on it, my particular issue was resolved. It seems that you might be having a similar issue, @qJake. Your home assistant instance needs to be accessible on all ports because it uses a random port number to transfer audio samples.

KTibow commented 1 year ago

huh it was resolved? what dependencies do you need to upgrade to resolve it?

rpatel3001 commented 1 year ago

https://github.com/esphome/feature-requests/issues/2239#issuecomment-1617250066

qJake commented 1 year ago

@rpatel3001 I use HA OS, so I have Home Assistant running in a VM on my hypervisor. I can send audio to the ESP32 box (cloud TTS works, as you pointed out, Piper does not work yet, plus it's about 5x as slow anyway).

If my HA instance is accessible over the internet, is there a port I need to forward / hard-code somewhere to get it to work?

jesserockz commented 1 year ago

Just to chime in here. I have been working on getting the s3-box working as a Voice Assistant for Home Assistant. The ongoing work is here esphome/esphome#5229 and esphome/esphome#5230. There is an example YAML file here: https://github.com/esphome/firmware/blob/main/voice-assistant/esp32-s3-box.yaml

llamaonaskateboard commented 1 year ago

@jesserockz Great work, seems to work well in VAD mode (I'm assuming remote wake word needs a dev build of HA?).

I tried to get the tt21100 touchscreen working at the same time but it seems there's a conflict between the i2c component and esp-adf where the microphone or speaker don't provide/produce any audio:

[11:48:48][C][i2c.idf:017]: Setting up I2C bus...
[11:48:48][I][i2c.idf:233]: Performing I2C bus recovery
[11:48:48][V][esp-idf:000]: I (26) gpio: GPIO[18]| InputEn: 1| OutputEn: 1| OpenDrain: 1| Pullup: 1| Pulldown: 0| Intr:0
[11:48:48][V][esp-idf:000]: I (26) gpio: GPIO[8]| InputEn: 1| OutputEn: 1| OpenDrain: 1| Pullup: 1| Pulldown: 0| Intr:0
[11:48:48][V][i2c.idf:056]: Scanning i2c bus for active devices...
[11:48:48][V][esp-idf:000]: I (56) gpio: GPIO[4]| InputEn: 0| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
[11:48:48][V][esp-idf:000]: I (56) gpio: GPIO[48]| InputEn: 0| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
[11:48:48][V][esp-idf:000]: I (57) gpio: GPIO[5]| InputEn: 0| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
[11:48:48][V][esp-idf:000]: I (383) gpio: GPIO[1]| InputEn: 1| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0
[11:48:48][I][esp_adf:015]: Start codec chip
[11:48:48][V][esp-idf:000]: E (384) i2c: i2c driver install error
[11:48:48][V][esp-idf:000]: E (385) I2C_BUS: components/esp_peripherals/driver/i2c_bus/i2c_bus.c:89 (i2c_bus_write_bytes):Handle error

fredyolha commented 1 year ago

@llamaonaskateboard I managed to get the wake word simply with the openwakeword addon installation (https://github.com/rhasspy/hassio-addons) without any issues.. If I omit the display and touch, it works prefectly... otherwise the same issue as u

ChristophCaina commented 11 months ago

hm... so I have to decide now between using the Wakeword - or the Display... That's unfortune... And... it seems that if Bluetooth-Proxy is enabled, the Wakeword detection / PushToTalk does not work?

netweaver1970 commented 11 months ago

hm... so I have to decide now between using the Wakeword - or the Display... That's unfortune... And... it seems that if Bluetooth-Proxy is enabled, the Wakeword detection / PushToTalk does not work?

I seem to have the same (BLE proxy + VAD/wakeword stuff coexistence) problem on an M5Stack echo. I wanted to make a super-duper sensor controllerout of it, I have it running fine as a BLE-Proxy + mmwave sensor node. But when extra adding the VAD/wakeword stuff (and avoid the switch behaviour clash), the sensor either doesn't do anything or goes haywire, needing physical reboot. Sad ... :(

qJake commented 11 months ago

Adding in my experience here as well...

I'm using (nearly verbatim) the example provided by ESPHome to get the ESP32-S3-Box working as a voice assistant.

Here's what works as of Nov 2023:

Wake word, including custom wake word trained on the wake word Colab
Speech to text recognition with good speed and accuracy
Text to speech using any TTS provider (Piper, HA Cloud, etc)
Buttons and LED status lights

Here's what I'm struggling to get working:

TTS audio output is broken using the speaker component - my console frequently gets flooded with: [W][voice_assistant:293]: Speaker buffer full.
I can't use the ESP32 device as both a speaker and a media_player - sometimes I want to play TTS audio from Home Assistant outside the context of voice recognition (e.g. an automation announcement)
Display does not work / can't be used at the same time
Wake word is spotty because it seems like it shuts off the microphone after ~5s of no audio detected, and then activates again once it hears anything - which means if you go from a completely silent room to just saying the wake word, it doesn't work the first time.

anth-dinosaur commented 11 months ago

Adding in my experience here as well...

I'm using (nearly verbatim) the example provided by ESPHome to get the ESP32-S3-Box working as a voice assistant.

Here's what works as of Nov 2023:

Wake word, including custom wake word trained on the wake word Colab

Speech to text recognition with good speed and accuracy

Text to speech using any TTS provider (Piper, HA Cloud, etc)

Buttons and LED status lights

Here's what I'm struggling to get working:

TTS audio output is broken using the speaker component - my console frequently gets flooded with: [W][voice_assistant:293]: Speaker buffer full.

I can't use the ESP32 device as both a speaker and a media_player - sometimes I want to play TTS audio from Home Assistant outside the context of voice recognition (e.g. an automation announcement)

Display does not work / can't be used at the same time

Wake word is spotty because it seems like it shuts off the microphone after ~5s of no audio detected, and then activates again once it hears anything - which means if you go from a completely silent room to just saying the wake word, it doesn't work the first time.

I have all of the same results as you. I get [voice_assistant:293]: Speaker buffer full. after 1-2 responses back from the assistant. I also notice that responses stop playing a little early, about 1-2 seconds before the end of the response. Also, more critically, on power cycle most of the configuration is lost:

The hostname is back to ESPHome Web XXXXX
The api encryption key is gone so HA can't talk to it
No switches/entities/etc are configured, and it is running the web_server component even though that is not in the below config
It seems like a full wipe back to when I first "prepared" it on esphome Web....except for that it has remembered its network settings (which were not configured with esphome web, and only upon flashing the config)

Used standard config as linked by HA docs (+ my wifi info): https://github.com/esphome/firmware/blob/main/voice-assistant/esp32-s3-box.yaml

Would be curious if others have the same issue?

shyney7 commented 10 months ago

Once all features are implemented will this also work with the newest model: ESP32-S3-Box3? https://github.com/espressif/esp-box/blob/master/docs/hardware_overview/esp32_s3_box_3/hardware_overview_for_box_3.md

sammcj commented 10 months ago

I have the new ESP32 S3 Box 3, currently have Willow installed but would much rather use ESPhome.

Id be happy to try a build on it and provide feedback if it helps.

esphome / feature-requests

Support for ESP32-S3-BOX peripherals + voice_assistant #2239

YAML:

Custom C++ Code: