esphome / feature-requests

ESPHome Feature Request Tracker
https://esphome.io/
404 stars 26 forks source link

Support for ESP32-S3-BOX peripherals + voice_assistant #2239

Open rpatel3001 opened 1 year ago

rpatel3001 commented 1 year ago

Describe the problem you have/What new integration you would like

Main features: support for peripherals on the ESP32-S3-BOX dev kit:

To get voice_assistant working:

Architectural changes to support wakeword and esp-idf framework (probably out of scope here and will be transferred to a new issue or 3 once the S3-BOX works for on-demand voice commands):

Please describe your use case for this integration and alternatives you've tried:

Use the peripherals on the board. Working on-demand voice_assistant.

Additional context

This device has recently had a bit of attention due to posts about Willow on hackernews and elsewhere. Willow is fantastic but I'd like to be able to use the full extent of existing esphome components, and I bet others would also. Adding hardware peripherals is the smallest part of this, wake word detection is the major missing feature missing to make esphome a viable alternative (out of scope for this feature request though).

Reference links: https://github.com/espressif/esp-box https://github.com/toverainc/willow https://github.com/hugobloem/esp-ha-speech https://github.com/espressif/esp-dev-kits/issues/24#issuecomment-781314125 https://components.espressif.com/components/espressif/es8311 https://components.espressif.com/components/espressif/es7210 https://github.com/espressif/esp-bsp/ https://github.com/espressif/esp-adf/

kroimon commented 1 year ago

Thanks for this overview!

I have created a component for the touchscreen already in esphome/esphome#4793 which is working fine on my Box and is ready for review.

I am currently working on the I2C control component for the ES8311 (no PR yet, trying to figure out the best solution for MCLK).

kroimon commented 1 year ago

Also, the ILI9342C driver requires some additions to allow enabling x-mirroring for the ESP32-S3-BOX. I have started implementing that, a PR will also follow. For now, you can check out my sample config linked in https://github.com/esphome/esphome/pull/4793#issuecomment-1539239430 which I will update periodically.

rpatel3001 commented 1 year ago

Awesome, nice progress. There is a very rough implementation for the ES8388 here, not sure how helpful it is as the register map is quite different and MCLK is currently hard-coded.

Also, it's worth looking into how willow and the default firmware handle MCLK, I think the codec has a mode which can derive it's LRCLK and BCLK from it's MCLK/SCLK and distribute them to the ADC on the board. That may be required if there is a requirement that MCLK is synchronous to LRCLK and BCLK? I'm not too familiar with I2S or the ESP32/esphome implementation of it.

Also, is it worth trying to get I2S support for the esp-idf framework as well, to make hacking in wake word stuff with esp-adf/esp-sr later easier? I haven't looked into this much but maybe not, since I think I saw an arduino framework wrapper for esp-adf somewhere.

kroimon commented 1 year ago

Yeah the ES8311 can theoretically work without a dedicated MCLK by generating it internally from the SCLK, but as the ESP32-S3-BOX has an MCLK wired to GPIO2 anyway, we should figure out the best way to implement that in esphome, I guess. Maybe @jesserockz already has plans for that?

My ES8311 branch is at https://github.com/kroimon/esphome/tree/es8311 if you're interested, but it's still a WIP and a few days away from a proper PR.

Also, getting the whole I2S stuff working on esp-idf would be great, because we could probably integrate libraries such as WakeNet much easier. However, I could not even get a very simple esphome config to run on my S3, because it kept resetting due to some watchdog. I did not debug this any further because using esp-idf wasn't of much use without the I2S components anyway.

rpatel3001 commented 1 year ago

Adding MCLK to i2s_audio seems to me like the most straightforward path for that, is there any case where two devices might share an LRCLK and BCLK but have different MCLKs? I simply added MCLK as an optional param for i2s_audio: https://github.com/esphome/esphome/compare/dev...rpatel3001:esphome:add_i2s_mclk

rpatel3001 commented 1 year ago

Also I forked your box.yaml gist to add the RGB LED that comes with the kit, invert the sense of the settings button, and add my MCLK change and your ES8311 components

rpatel3001 commented 1 year ago

i've successfully gotten home assistant to stream TTS and radio audio to the ESP-BOX using the config in my gist. the volume is quite low, though I expect your work on the codec interface will help with that.

kroimon commented 1 year ago

I probably won't have time to look into it before Friday afternoon, but that already sounds awesome!

rpatel3001 commented 1 year ago

started some ADC code at https://github.com/rpatel3001/esphome/tree/es7210

this I2S stuff make very little sense to me right now, the frequencies I measure on the pins are not at all what it looks like is configured by the i2s components. It's difficult to debug the ADC without access to the raw audio, trying to send it to the home assistant pipeline with whisper actually causes an error in whisper so it's clearly doing something wrong.

also the ADC datasheet is terrible, I could only find a register map on some sketchy chinese site by googling and it's version 2.0, compared to the most recent version 23 (without registers).

rpatel3001 commented 1 year ago

dumping some thoughts here stream of consciousness style: I think ideally i2s_audio would have options for mclk frequency and sample rate and that would be pulled into i2s_audio_media_player and i2s_audio_microphone to setup the i2s peripheral in the same way the pin numbers are currently pulled in. the DAC and ADC I2C components would need options as well to setup the chips with the correct options based on the clock settings.

it's unclear to me why i2s_audio_microphone and i2s_audio_speaker are calling esp-idf i2s functions but i2s_audio_media_player is not. The media player library is handling it internally? how do these two components work together?

ssieb commented 1 year ago

The media player library is a little difficult that way. It's kind of a black box right now. We tell it what to play and it just does it. And yes, if devices need an MCLK signal, then that should be added to the i2s audio component as an optional parameter. Someone asked about that a while back, but the easier solution was to change the device setting to not require it. But he was doing the wiring, so that was easy to do.

jesserockz commented 1 year ago

it's unclear to me why i2s_audio_microphone and i2s_audio_speaker are calling esp-idf i2s functions but i2s_audio_media_player is not

This is because the Audio library handles the streaming, decoding and playing to i2s. It's not the best solution, but it was the easiest at the time given the timeframe I had. The weird thing is the library actually supports calling a function to give the i2s data to and not send it out, but it still requires to set up the i2s peripheral itself :facepalm:

kroimon commented 1 year ago

I mean, we could probably make changes to the Audio library, as it is already a modified fork. The question is how close to upstream you want it to be. I think the main benefit of using the library in the first place are it's audio format decoders. The I2S stuff could be implemented in native esphome code to be able to better integrate different external codec chips.

kroimon commented 1 year ago

I spent some more time learning the inner workings of I2S and how the different components use it right now. The following is a list of findings and 'challenges' I ran into:

The main issue we have is that there is currently no central instance that controls the parameters of the I2S bus. The i2s_audio platform merely acts as a container for the pin configuration, but the actual calls to i2s_driver_install() and i2s_set_pin() are done in i2s_audio_microphone.cpp, i2s_audio_speaker.cpp and i2s_audio_media_player.cpp. In addition to that, the external ESP32-audioI2S component calls i2s_set_sample_rates depending on the media being played.

This makes it very hard to implement external ADCs and DACs whose configuration depend on the current clock speeds and sampling rates. Those audio codec components need a central instance to register for configuration change events so the new settings can be forwarded to the external controllers.

With the current architecture, there is also no way for full-duplex operation of the same I2S port. The Mutex in the i2s_audio component only allows exclusive access to an I2S port. However, the ESP32-S3-BOX and ESP32-S3-Korvo-2 boards share the same I2S port (MCLK, SCLK, LRCK pins) for both audio input and output.

In general, full-duplex operation can only work if both input and output use the same clock parameters. The microphone and speaker components currently use fixed 16000 Hz sampling rates at 16 bits per sample. The media player switches the sampling rates based on the currently played files/streams. So I don't really see a way to use a media player together with a microphone and/or speaker component for a voice assistant right now, at least not at the same time. It might be possible to implement a priority-based switching logic that allows them to coexist.

ESP-IDF 5.0 introduced the concept of 'channels' in the new i2s driver which would make full-duplex operation a somewhat easier task. (For reference, the latest currently available version of arduino-esp32 2.0.9 is based on ESP-IDF 4.4.4).

In summary, I think we need a major refactoring of the i2s_audio platform and its microphone, speaker and media_player components:

nagyrobi commented 1 year ago

See how many ideas are outthere for media player in ESPHome: https://github.com/esphome/feature-requests/labels/integration%3A%20media_player There's no other topic so hot imho...

kroimon commented 1 year ago

@rpatel3001 I found the full datasheet for the ES7210 here (Backup). Unfortunately I was still unable to locate the corresponding user guide, but this should be enough information to get it working, together with the existing implementations in esp-bsp and esp-adf. I feel like the esp-adf implementation is even more helpful as it shows all the bits and pieces required for mic selection.

I continued a bit on your work over in my branch, mostly formatting and cleanup for now.

guillempages commented 1 year ago

I made the "mistake" of trying to save some bucks and bought the ESP32-S3-Box-Lite instead of the full one. That one does not have touchscreen, but three additional buttons, and it has (apparently) an ST7789v display instead of an ILI9342C one.

For some (to me yet unexplained) reason, I can show things on the display by using the ILI9342C configuration from @kroimon (https://gist.github.com/kroimon/f6692879f9c00702990801ae9dfa433b); it just doesn't need the mirroring, but the colors are somehow offset (e.g. Red is (255, 255,0), Green is (255, 0, 255) and Blue is (0, 255, 255); while White and Black would be the expected colors). I haven't managed to show anything useful using the standard st7789 component. Does anyone have an idea why this would be?

Is it worth it, to track the S3-Box-Lite support here as well, or would it be better to create a separate Feature Request? (Since most of the components would be the same anyway).

mattkasa commented 1 year ago

Seems like the peripherals of the ESP32-S3-Korvo-1 are really similar to ESP32-S3-BOX as well.

One main difference is the ES7210 is on a different I2S bus from the ES8311.

I have an ESP32-S3-Korvo-1 running this config and LED ring and buttons are working, audio not working at all yet so I'm not sure I have the two I2S buses configured correctly or maybe two i2s_audio buses aren't supported yet.

Waiting on an ESP32-S3-BOX to be able to do more testing, but the Korvo is currently in stock on Amazon for 50USD if anyone else is curious about it.

rpatel3001 commented 1 year ago

@guillempages I can add the Lite's display to the top post, but can't promise anyone will work on it as I don't have a Lite to play with. You'll probably get more visibility/help by creating a bug report for the st7789 component.

@mattkasa I think two I2S buses ought to work, but not totally sure. Does the codec work by itself if you comment out the ADC config? The current tip of the ES8311 PR sets the volume to 0, try an earlier commit or my es8311 branch for now.

mattkasa commented 1 year ago

@rpatel3001 I'm testing like this, but I have no idea how speaker.play: is supposed to look:

    on_press:
      - output.turn_on: pa_ctrl
      - speaker.play:
          id: external_speaker
          data: [64, 64, 0, 0, 128, 128, 0, 0, 64, 64, 0, 0, 128, 128, 0, 0, 64, 64, 0, 0, 128, 128, 0, 0, 64, 64, 0, 0, 128, 128, 0, 0, 64, 64, 0, 0, 128, 128, 0, 0]
      - output.turn_off: pa_ctrl

Not getting any audible sound, but logs look like:

[02:23:19][C][es8311:167]: ES8311 Audio Codec:
[02:23:19][C][es8311:168]:   Use MCLK: YES
[02:23:49][D][sensor:094]: 'button_adc': Sending state 1.63600 V with 2 decimals of accuracy
[02:23:49][D][binary_sensor:036]: 'Korvo 1 Play': Sending state ON
[02:23:49][D][esp-idf:000]: I (38072) I2S: DMA Malloc info, datalen=blocksize=4092, dma_buf_count=8

[02:23:49][D][esp-idf:000]: I (38074) I2S: I2S0, MCLK output by GPIO42

[02:23:50][D][esp-idf:000]: I (38239) I2S: DMA queue destroyed

So I wonder if it's just my speaker.play data :thinking:

rpatel3001 commented 1 year ago

hm, I can't say about speaker.play, I've been using home assistant to send audio to the media_player component. Do you at least get clicks when the PA is muted/unmuted? Maybe try the media player component also, the I2S code is different.

mattkasa commented 1 year ago

Ah yeah, I'm using the esp-idf framework, so media_player isn't supported, my thinking has been to use esp-idf to make it easier to build a component that uses esp-sr for wakeword since it seems like that's probably where all of this is headed :)

edit: I tried building with arduino to test with media_player and it panics and boot loops, there is something the bootloader doesn't like, I'll keep looking at it to see if I can get it running with arduino.

rpatel3001 commented 1 year ago

I did some testing with i2s_audio_speaker and it seems to be partially working (on Arduino). With a much longer data vector (8k samples = half a second, a full second crashed the board when played) I mostly just hear clicks but occasionally the tone plays for a fraction of the duration. Interestingly the tone is twice the frequency it should be, which is maybe a clue about what's wrong.

I also tried compiling a barebones config for esp-idf but it bootloops. Fixed the bootloop with

  platformio_options:
    board_build.flash_mode: dio

but then it just hangs after booting. Haven't found a fix for that, it does this even with the most recent esp-idf version/platform_version.

mattkasa commented 1 year ago

@rpatel3001 for esp-idf try:

esp32:
  board: esp32s3box
  framework:
    type: esp-idf
  variant: ESP32S3

I was able to get arduino working on the Korvo with this:

esp32:
  board: esp32-s3-devkitc-1
  variant: esp32s3
  framework:
    type: arduino

And media_player tries to work, but no sound, not even clicks, so I don't have something right with the I2S bus.

[05:36:40][D][media_player:059]: 'Korvo 1 Media Player' - Setting
[05:36:40][D][media_player:066]:   Media URL: https://homeassistant.local/api/tts_proxy/726c76553e1a3fdea29134f36e6af2ea05ec5cce_en-us_a877e2b3bf_tts.piper.wav
rpatel3001 commented 1 year ago

Adding the variant and/or changing the board didn't change anything unfortunately.

hamishfagg commented 1 year ago

The LilyGo T-Embed has the ES7210 as well, so this will be great for making tiny assistants :)

guillempages commented 1 year ago

@rpatel3001 I managed to get the colors working (more-or-less) on the ESPBox Lite, by hacking the code in the ILI9xxx display to force BGR byte order and invert display. If I get some time I'll try making a PR on ESPHome so that this can be configured in the yaml file, and then the ESPBox Liste display could be set to done :-)

guillempages commented 1 year ago

@rpatel3001 I've created two PRs to be able to use the displays out of the box: https://github.com/esphome/esphome/pull/4941 (for the Box-Lite) https://github.com/esphome/esphome/pull/4942 (for the Box)

Since I do not have a "full" Box; could you try using the ili9xxx display from the 4942 PR and see if the mirroring and colors work without the workaround?

rpatel3001 commented 1 year ago

sweet, I tried it out and it works. checklist updated.

rpatel3001 commented 1 year ago

I've been away from this for a while and will be for another week or so but I modified the i2s microphone to stream samples to matlab. The data clearly has some relationship to the actual audio in the environment, tested by FFTing/playing/plotting the streamed samples with silence and with test tones playing, but the sample rate doesn't match, it's quite noisy, and for many captures every other sample is +/- 32767 or has some DC offset, so there're several issues. I can drop the patch and matlab script here if anyone else is interested, but I won't be working on it for a little bit.

KTibow commented 1 year ago

Any updates on the speaker? Is this config correct? I don't hear any sound, not even clicks. Framework is arduino.

i2s_audio:
  i2s_lrclk_pin: GPIO47
  i2s_bclk_pin: GPIO17
  i2s_mclk_pin: GPIO2
media_player:
  - platform: i2s_audio
    name: Speaker
    dac_type: external
    i2s_dout_pin: GPIO15
rpatel3001 commented 1 year ago

You're missing

    mute_pin:
      number: GPIO46
      inverted: true
    mode: mono
KTibow commented 1 year ago

Now I get static when I play something. I don't hear any clicks, and when I'm not playing something there's no sound.

rpatel3001 commented 1 year ago

Do you also have:

external_components:
  - source: github://pr#4861
    components: [ es8311 ]

es8311:
  address: 0x18
KTibow commented 1 year ago

Added that, and now it can successfully buzz and click.

richlawson commented 1 year ago

After reviewing a number of threads and PRs, I have the following things working on my box:

I'm running into issues with the microphone and speakers, though. At this point, I'm not really sure how to test the mic. I tried to set it up with the voice assistant, but I don't have a good way to activate that right now.

I used some of the information above to try to get the speaker working. Based on some of the tasks left to provide support, I have a feeling that I'm probably trying to do too much at once in terms of having the mic, speaker, and media_player going all at once, given what's supported at this point.

When I open up the speaker and use a TTS like pico/piper to try to get it to produce sound, I do hear a popping noise.

Here's my current config:

esphome:
  name: box
  friendly_name: Box

esp32:
  board: esp32s3box
  framework:
    type: arduino

external_components:
  - source: github://pr#4793
    components: [ tt21100 ]
  - source: github://pr#4861
    components: [ es8311 ]

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

  # Enable fallback hotspot (captive portal) in case wifi connection fails
  ap:
    ssid: "Box Fallback Hotspot"
    password: "<removed>"

# Enable Home Assistant API
api:
  encryption:
    key: "<removed>"

ota:
  password: "<removed>"

# Enable logging
logger:

time:
  - platform: sntp
    id: time_sntp

#time:
#  - platform: homeassistant
#    id: time_ha

output:
  - platform: ledc
    pin: GPIO45
    id: lcd_backlight
  - platform: gpio
    pin: GPIO46
    id: ns4150_ctrl

light:
  - platform: monochromatic
    output: lcd_backlight
    name: "LCD Backlight"
    restore_mode: ALWAYS_ON

spi:
  clk_pin: GPIO7
  mosi_pin: GPIO6

display:
  - platform: ili9xxx
    model: S3BOX
    cs_pin: GPIO5
    dc_pin: GPIO4
    reset_pin: GPIO48
    id: lcd
    # Width = 320, Height = 240
    lambda: |-
      it.fill(Color::WHITE);
      auto red = Color(255, 0, 0);
      auto green = Color(0, 255, 0);
      auto blue = Color(0, 0, 255);
      it.filled_rectangle(10, 170, 60, 60, red);
      it.filled_rectangle(130, 170, 60, 60, green);
      it.filled_rectangle(250, 170, 60, 60, blue);
      it.strftime(160, 85, id(font_time), Color::BLACK, TextAlign::CENTER, "%H:%M", id(time_sntp).now());
      if (id(muted).state) {
        it.print(310, 10, id(font_small), red, TextAlign::TOP_RIGHT, "Muted");
      }

font:
  - file: "gfonts://Roboto@900"
    id: font_time
    size: 80
    glyphs: "0123456789:"
  - file: "gfonts://Roboto"
    id: font_small
    size: 20

i2c:
  scl: GPIO18
  sda: GPIO8
  scan: true

touchscreen:
  - platform: tt21100
    address: 0x24
    interrupt_pin: GPIO3
    # Don't use as the reset pin is shared with the display, so the ili9xxx will perform the reset
    #reset_pin: GPIO48

binary_sensor:
  - platform: gpio
    pin:
      number: GPIO0
      mode: INPUT_PULLUP
    id: settings
    name: "Settings"
  - platform: gpio
    pin:
      number: GPIO1
      inverted: true
    id: muted
    name: "Muted"
  - platform: tt21100
    name: "Home"
    index: 0
  - platform: touchscreen
    name: "Red"
    x_min: 10
    x_max: 70
    y_min: 170
    y_max: 230
  - platform: touchscreen
    name: "Green"
    x_min: 130
    x_max: 190
    y_min: 170
    y_max: 230
  - platform: touchscreen
    name: "Blue"
    x_min: 250
    x_max: 310
    y_min: 170
    y_max: 230

i2s_audio:
  i2s_lrclk_pin: GPIO47
  i2s_bclk_pin: GPIO17
  i2s_mclk_pin: GPIO2

es8311:
  address: 0x18

bluetooth_proxy:

voice_assistant:
  microphone: mic
  speaker: audio

button:
  - platform: restart
    name: "Restart Device"

text_sensor:
  - platform: wifi_info
    ip_address:
      name: IP Address
    ssid:
      name: Connected SSID
    bssid:
      name: Connected BSSID
    mac_address:
      name: Mac Wifi Address
    scan_results:
      name: Latest Scan Results

sensor:
  - platform: wifi_signal
    name: "WiFi Signal Sensor"
    update_interval: 60s

  - platform: wifi_signal
    name: "WiFi Signal dB"
    id: wifi_signal_db
    update_interval: 60s
    entity_category: "diagnostic"

microphone:
  - platform: i2s_audio
    id: mic
    adc_type: external
    pdm: false
    i2s_din_pin: GPIO16

speaker:
  - platform: i2s_audio
    id: audio
    dac_type: external
    i2s_dout_pin: GPIO15
    mode: mono

media_player:
  - platform: i2s_audio
    name: Speaker
    dac_type: external
    i2s_dout_pin: GPIO15
    mute_pin:
      number: GPIO46
      inverted: true
    mode: mono

# i2c device at address 0x18 - ES8311 Audio Codec
# i2c device at address 0x24 - TT21100 Touchscreen
# i2c device at address 0x40 - ES7210 Mic ADC
# i2c device at address 0x68 - ICM-42607-P IMU 

captive_portal:

Does anyone have any ideas?

rpatel3001 commented 1 year ago

Speaker and Media Player might be mutually exclusive. I have on_press actions for the settings button to start and stop the mic, you can also start and stop the voice assistant that way. Try media player without speaker and see if you can play piper or web radio from home assistant.

rpatel3001 commented 1 year ago

I think maybe the ADC has been working ok this whole time? Using this config to stream samples to matlab, the audio comes through fine. Whisper is still failing though, with the same sort of error:

ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-50' coro=<AsyncEventHandler.run() done, defined at /usr/local/lib/python3.9/dist-packages/wyoming/server.py:26> exception=ValueError("can't extend empty axis 0 using modes other than 'constant' or 'empty'")>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/wyoming/server.py", line 32, in run
    if not (await self.handle_event(event)):
  File "/usr/local/lib/python3.9/dist-packages/wyoming_faster_whisper/handler.py", line 61, in handle_event
    segments, _info = self.model.transcribe(
  File "/usr/local/lib/python3.9/dist-packages/wyoming_faster_whisper/faster_whisper/transcribe.py", line 124, in transcribe
    features = self.feature_extractor(audio)
  File "/usr/local/lib/python3.9/dist-packages/wyoming_faster_whisper/faster_whisper/feature_extractor.py", line 152, in __call__
    frames = self.fram_wave(waveform)
  File "/usr/local/lib/python3.9/dist-packages/wyoming_faster_whisper/faster_whisper/feature_extractor.py", line 98, in fram_wave
    frame = np.pad(frame, pad_width=padd_width, mode="reflect")
  File "<__array_function__ internals>", line 200, in pad
  File "/usr/local/lib/python3.9/dist-packages/numpy/lib/arraypad.py", line 815, in pad
    raise ValueError(
ValueError: can't extend empty axis 0 using modes other than 'constant' or 'empty'

Seems like the audio format is not quite what's expected by whisper.

Matlab script:

clear;
t = tcpserver(6666);
z=zeros(32000,1);
while 1
    if t.NumBytesAvailable > 5*1600
        x = char(read(t, 5*1600));
        y = textscan(x,'%xs16','Delimiter',',');
        a = double(y{1})/2^15;
        z = [z(1601:end); a];
        plot(z);
        drawnow;
        sound(a,16000);
    end
end
richlawson commented 1 year ago

Thanks for the suggestions. This has been my first attempt at customizing anything with ESPHome.

I didn't realize that voice_assistant could use speaker or media_player earlier, which is why I was concerned about removing the speaker earlier today. Reviewing the docs helped with that: https://esphome.io/components/voice_assistant.html

So I removed speaker and updated voice_assistant to this:

voice_assistant:
  microphone: mic
  media_player: audio

Then I added id: audio to my media_player config.

Next, I updated my settings button like you recommended:

- platform: gpio
    pin:
      number: GPIO0
      mode: INPUT_PULLUP
    id: settings
    name: "Settings"
    on_press:
      - if:
          condition: voice_assistant.is_running
          then:
            - voice_assistant.stop:
          else:
            - voice_assistant.start_continuous:

Everything compiled without issues, and the settings button did activate the voice_assist pipeline, but it just timed out each time for a few tries; [E][voice_assistant:231]: Error: pipeline-timeout - Pipeline timeout

I also tried piper with the media_player, and that gave me the same popping noise.

For the mic, I remembered what you said about turning the microphone on/off with your settings button, so I tried this out based on https://esphome.io/components/microphone/index.html:

  - platform: gpio
    pin:
      number: GPIO0
      mode: INPUT_PULLUP
    id: settings
    name: "Settings"
    on_press:
      - if:
          condition: voice_assistant.is_running
          then:
            - microphone.stop_capture:
            - voice_assistant.stop:
          else:
            - microphone.capture:
            - voice_assistant.start_continuous:

That didn't help, though, and I started getting things in the logs that I hadn't seen before, like ERROR Serial port closed! and this:

[18:19:43][03mD[iaysno06:'eig: SndigsaeOF[m
[18:19:43]\0330;6[]bnr_esr06:'etns:Sedn tt 0
[18:19:43][D[ie_assat14] Sinln tp.[m
[18:19:46][03mD[iar_eso:3] 'etn' enigsateOF[m
[18:19:46][0;6[]baysno:3] Stig' edn tt N[m
[18:19:46]D[oc_sitn:3] eusigsat.\0330
[18:19:46]\0330;6D[oc_sitn:1] trtin..[m
[18:20:02]\03313mE[oc_sitn:3] ro:ppelietmot-Ppln ieu\0330[D][ocassat14:Sgan tp.\0330[03mD[iaysno:3] Stig' dn tt F\0330\0330;6[]bnr_esr06:'etns:SnigsaeO\033m
[18:20:02]]vieassat12:Rqetn tr..0\0330;3mD[oc_sitn:1] tri..[m
[18:20:02]]vieassat14AsitPpln ung[m

I'm going to remove the microphone capture/stop capture as a next step, since that didn't seem to help.

Any other ideas on the mic and speaker?

richlawson commented 1 year ago

@rpatel3001, I just saw your post. I'm taking a look at that and your config.

For me, I was thinking about adding this to microphone to at least see if anything is coming through via the log:

on_data:
      - logger.log:
          format: "Received %d bytes"
          args: ['x.size()']

For the speaker, I hadn't heard of web radio, but I found a radio browser integration that I just added.

rpatel3001 commented 1 year ago

So the speaker component can't play anything except what the voice_assistant sends back, web radio and TTS will only work with media_player.

I don't know if voice_assistant will work with media_player, my understanding was that it needs the speaker component.

Also, you'll see in my config that I have the button activating the voice assistant only whole pressed, and on_release ends the capture. Probably will avoid timing out that way.

The current state of this is that everything individually seems to be working, but something about the microphone and es7210 is not configured to pass data in the way whisper expects for STT, so voice_assistant doesn't actually do anything yet.

richlawson commented 1 year ago

Thanks again!

So the speaker component can't play anything except what the voice_assistant sends back, web radio and TTS will only work with media_player.

Got it. I'm just not sure why I can't get the tts to work now that I'm using media_player. I'll try that radio integration I found.

I don't know if voice_assistant will work with media_player, my understanding was that it needs the speaker component.

It looks like it should based on the docs: media_player (Optional, ID): The media_player to use to output the response. Cannot be used with speaker above.

Also, you'll see in my config that I have the button activating the voice assistant only whole pressed, and on_release ends the capture. Probably will avoid timing out that way.

Yeah, that's probably a good idea.

I also noticed that you're doing a few other things in your config that are different:

  - source: github://rpatel3001/esphome@es7210
    components: [ es7210 ]
  - source: github://rpatel3001/esphome@mictest
    components: [ i2s_audio ]
...
es7210:
  address: 0x40

I haven't take a look at those on your GitHub yet, though.

rpatel3001 commented 1 year ago

the es7210 component is needed to setup the ADC chip over SPI, same as the es8311 component does for the DAC.

The media_player option for voice_assistant must be new, that's very cool. My mictest branch of i2s_audio just adds streaming samples over TCP, you don't need that unless you want to analyze raw ADC samples, with the above matlab script or another tool.

KTibow commented 1 year ago

I'm a bit confused. Is it possible to get the speaker to make noise beyond clicking yet?

rpatel3001 commented 1 year ago

@KTibow yes you should be able to play media from home assistant if you use the media_player component.

KTibow commented 1 year ago

I can play media, but it just makes popping sounds.

Config
external_components:
  - source: "github://pr#4793"
    components: [ tt21100 ]
  - source: "github://pr#4861"
    components: [ es8311 ]
i2c:
  scl: GPIO18
  sda: GPIO8
i2s_audio:
  i2s_lrclk_pin: GPIO47
  i2s_bclk_pin: GPIO17
  i2s_mclk_pin: GPIO2
es8311:
  address: 0x18
media_player:
  - platform: i2s_audio
    name: Speaker
    dac_type: external
    i2s_dout_pin: GPIO15
    mute_pin:
      number: GPIO46
      inverted: true
rpatel3001 commented 1 year ago

I believe that config should do just fine. My config is linked here and I can play web radio and TTS.

KTibow commented 1 year ago

After changing some stuff to more closely match your config, the speaker works!

rpatel3001 commented 1 year ago

Back to getting voice commands working, I inserted some prints into the whisper container's python and it seems like it's receiving audio-start and audio-stop events from wyoming but no audio-chunk events, so no audio samples making it from esphome to whisper. This seems most likely to be a problem in home assistant or the voice_assistant component in esphome.

edit: async_process_audio_stream in homeassistant/components/wyoming/stt.py is receiving a stream object that has no contents, but the metadata seems correct. Tracking down where this comes from and if this is an issue with HA or esphome.

richlawson commented 1 year ago

I'm not sure what to do about the speaker. I keep getting the popping noise but nothing else. I completely replaced my configuration with this, other than putting in my own ap/ota/api passwords/key: https://gist.github.com/rpatel3001/ffd160577b96585fda144b786d789f46

That includes removing mode: mono.

In the logs, I'm seeing this:

[00:44:26][D][media_player:066]:   Media URL: http://home-assistant.local:8123/api/tts_proxy/32b7cbdc35a6c367b425528d61d48e8570a81c95_en-us_22597d2fbc_tts.piper.wav
[00:44:26][727290][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[00:44:27][728202][E][WiFiClient.cpp:268] connect(): socket error on fd 52, errno: 104, "Connection reset by peer"
[00:44:27][728375][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[00:44:27][728424][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.

It seems like it might be related to this: https://github.com/esphome/issues/issues/4088

The radio isn't working for me, either:

[00:46:00][821533][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[00:46:01][821796][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[00:46:01][822125][V][ssl_client.cpp:62] start_ssl_client(): Free internal heap before TLS 233112
[00:46:01][822125][V][ssl_client.cpp:68] start_ssl_client(): Starting socket
[00:46:01][822132][V][ssl_client.cpp:149] start_ssl_client(): Seeding the random number generator
[00:46:01][822137][V][ssl_client.cpp:158] start_ssl_client(): Setting up the SSL/TLS structure...
[00:46:01][822143][D][ssl_client.cpp:179] start_ssl_client(): WARNING: Skipping SSL Verification. INSECURE!
[00:46:01][822152][V][ssl_client.cpp:257] start_ssl_client(): Setting hostname for TLS session...
[00:46:01][822159][V][ssl_client.cpp:272] start_ssl_client(): Performing the SSL/TLS handshake...
[00:46:01][822532][V][ssl_client.cpp:293] start_ssl_client(): Verifying peer X.509 certificate...
[00:46:01][822533][V][ssl_client.cpp:301] start_ssl_client(): Certificate verified.
[00:46:01][822536][V][ssl_client.cpp:316] start_ssl_client(): Free internal heap after TLS 195720
[00:46:01][822543][V][ssl_client.cpp:369] send_ssl_data(): Writing HTTP request with 175 bytes...
[00:46:01][822570][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.
[00:46:01][822637][V][ssl_client.cpp:324] stop_ssl_socket(): Cleaning SSL connection.

This might be related to that: https://github.com/esphome/issues/issues/4369

Does ESPHome require HTTPS for media/tts?