Wake word still does not respond after a while

itnassol commented 3 months ago

The problem

The response at first is perfect I can call the wake word and is very quick to respond. However if I don't use it for a while, and I have done some testing and it's about 15 minutes, it's like it has gone to sleep I then have to tap the speaker to "wake it up" it's then good for another 15 minutes.

Which version of ESPHome has the issue?

2024.6.2

What type of installation are you using?

Home Assistant Add-on

Which version of Home Assistant has the issue?

2024.6.4

What platform are you using?

ESP32

Board

onju-voice

Component causing the issue

No response

Example YAML snippet

substitutions:
  name: "dr-ada"
  friendly_name: "DR Ada"
  wifi_ap_password: "password"

esphome:
  name: ${name}
  friendly_name: ${friendly_name}
  name_add_mac_suffix: false
  min_version: 2023.10.1
  on_boot:
    then:
      - light.turn_on:
          id: top_led
          effect: slow_pulse
          red: 100%
          green: 60%
          blue: 0%
      - wait_until:
          condition:
            wifi.connected:
      - light.turn_on:
          id: top_led
          effect: pulse
          red: 0%
          green: 100%
          blue: 0%
      - wait_until:
          condition:
            api.connected:
      - light.turn_on:
          id: top_led
          effect: none
          red: 0%
          green: 100%
          blue: 0%
      - delay: 1s
      - script.execute: reset_led

esp32:
  board: esp32-s3-devkitc-1
  framework:
    type: arduino

logger:
api:
  encryption:
   key: "xxxxxxxx"
  services:
    - service: start_va
      then:
        - voice_assistant.start
    - service: stop_va
      then:
        - voice_assistant.stop

ota:
  platform: esphome

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password
  ap:
    password: "${wifi_ap_password}"
  output_power: 8.5dB

captive_portal:

globals:
  - id: thresh_percent
    type: float
    initial_value: "0.03"
    restore_value: false
  - id: touch_calibration_values_left
    type: uint32_t[5]
    restore_value: false
  - id: touch_calibration_values_center
    type: uint32_t[5]
    restore_value: false
  - id: touch_calibration_values_right
    type: uint32_t[5]
    restore_value: false

interval:
  - interval: 1s
    then:
      - script.execute:
          id: calibrate_touch
          button: 0
      - script.execute:
          id: calibrate_touch
          button: 1
      - script.execute:
          id: calibrate_touch
          button: 2

i2s_audio:
  - i2s_lrclk_pin: GPIO13
    i2s_bclk_pin: GPIO18

media_player:
  - platform: i2s_audio
    name: None
    id: onju_out
    dac_type: external
    i2s_dout_pin: GPIO12
    mode: mono
    mute_pin:
      number: GPIO21
      inverted: True

######
# speaker:
#   - platform: i2s_audio
#     id: onju_out
#     dac_type: external
#     i2s_dout_pin: GPIO12
#     mode: stereo
######

microphone:
  - platform: i2s_audio
    id: onju_microphone
    i2s_din_pin: GPIO17
    adc_type: external
    pdm: false

voice_assistant:
  id: va
  microphone: onju_microphone
  media_player: onju_out
######
  # speaker: onju_out
######
  use_wake_word: true
  on_start:
    - light.turn_on:
        id: top_led
        blue: 100%
        red: 0%
        green: 0%
        effect: none
  on_listening:
    - light.turn_on:
        id: top_led
        blue: 100%
        red: 0%
        green: 0%
        brightness: 100%
        effect: pulse
  on_tts_end:
    - media_player.play_media: !lambda return x;
    - light.turn_on:
        id: top_led
        blue: 0%
        red: 20%
        green: 100%
        effect: pulse
  on_end:
    - delay: 100ms
    - wait_until:
        not:
          media_player.is_playing: onju_out
    - script.execute: reset_led
  on_client_connected:
    - if:
        condition:
          and:
            - switch.is_on: use_wake_word
            - binary_sensor.is_off: mute_switch
        then:
          - voice_assistant.start_continuous:
  on_client_disconnected:
    - if:
        condition:
          and:
            - switch.is_on: use_wake_word
            - binary_sensor.is_off: mute_switch
        then:
          - voice_assistant.stop:
  on_error:
    - light.turn_on:
        id: top_led
        blue: 0%
        red: 100%
        green: 0%
    - delay: 1s
    - script.execute: reset_led

number:
  - platform: template
    name: "Touch threshold percentage"
    id: touch_threshold_percentage
    update_interval: never
    entity_category: config
    initial_value: 1.25
    min_value: -1
    max_value: 5
    step: 0.25
    optimistic: true
    on_value:
      then:
        - lambda: !lambda |-
            id(thresh_percent) = 0.01 * x;

esp32_touch:
  setup_mode: false
  sleep_duration: 2ms 
  measurement_duration: 800us
  low_voltage_reference: 0.8V
  high_voltage_reference: 2.4V

  filter_mode: IIR_16
  debounce_count: 2
  noise_threshold: 0
  jitter_step: 0
  smooth_mode: IIR_2

  denoise_grade: BIT8
  denoise_cap_level: L0

binary_sensor:
  - platform: esp32_touch
    id: volume_down
    pin: GPIO4
    threshold: 539000 # 533156-551132
    on_press: 
      then:
        - light.turn_on: left_led
        - script.execute:
            id: set_volume
            volume: -0.05
        - delay: 1s
        - while:
            condition:
              binary_sensor.is_on: volume_down
            then:
              - script.execute:
                  id: set_volume
                  volume: -0.05
              - delay: 150ms
    on_release: 
      then:
        - light.turn_off: left_led

  - platform: esp32_touch
    id: volume_up
    pin: GPIO2
    threshold: 580000 # 575735-593064
    on_press: 
      then:
        - light.turn_on: right_led
        - script.execute:
            id: set_volume
            volume: 0.05
        - delay: 1s
        - while:
            condition:
              binary_sensor.is_on: volume_up
            then:
              - script.execute:
                  id: set_volume
                  volume: 0.05
              - delay: 150ms
    on_release: 
      then:
        - light.turn_off: right_led

  - platform: esp32_touch
    id: action
    pin: GPIO3
    threshold: 751000 # 745618-767100
    on_click:
      - if:
          condition:
            or:
              - switch.is_off: use_wake_word
              - binary_sensor.is_on: mute_switch
          then:
            - if:
                condition: voice_assistant.is_running
                then:
                  - voice_assistant.stop:
                  - script.execute: reset_led
                else:
                  - voice_assistant.start:
          else:
            - voice_assistant.stop
            - delay: 1s
            - script.execute: reset_led
            - script.wait: reset_led
            - voice_assistant.start_continuous:

  - platform: gpio
    id: mute_switch
    pin:
      number: GPIO38
      mode: INPUT_PULLUP
    name: Disable wake word
    on_press:
      - script.execute: turn_on_wake_word
    on_release:
      - script.execute: turn_off_wake_word

  - platform: status
    id: api_connection
    filters:
      - delayed_on: 1s
    on_press:
      - if:
          condition:
            and:
              - switch.is_on: use_wake_word
              - binary_sensor.is_off: mute_switch
          then:
            - voice_assistant.start_continuous:
    on_release:
      - if:
          condition:
            and:
              - switch.is_on: use_wake_word
              - binary_sensor.is_off: mute_switch
          then:
            - voice_assistant.stop:

light:
  - platform: esp32_rmt_led_strip
    id: leds
    pin: GPIO11
    chipset: SK6812
    num_leds: 6
    rgb_order: grb
    rmt_channel: 0
    default_transition_length: 0s
    gamma_correct: 2.8
  - platform: partition
    id: left_led
    segments:
      - id: leds
        from: 0
        to: 0
  - platform: partition
    id: top_led
    segments:
      - id: leds
        from: 1
        to: 4
    effects:
      - pulse:
          name: pulse
          transition_length: 250ms
          update_interval: 250ms
      - pulse:
          name: slow_pulse
          transition_length: 1s
          update_interval: 2s
      - addressable_lambda: 
          name: show_volume
          update_interval: 50ms
          lambda: |-
            int int_volume = int(id(onju_out).volume * 100.0f * it.size());
            int full_leds = int_volume / 100;
            int last_brightness = int_volume % 100;
            int i = 0;
            for(; i < full_leds; i++) {
              it[i] = Color::WHITE;
            }
            if(i < 4) {
              it[i++] = Color(0,0,0).fade_to_white(last_brightness*256/100);
            }
            for(; i < it.size(); i++) {
              it[i] = Color::BLACK;
            }
  - platform: partition
    id: right_led
    segments:
      - id: leds
        from: 5
        to: 5

script:
  - id: reset_led
    then:
      - if:
          condition:
            and:
              - switch.is_on: use_wake_word
              - binary_sensor.is_off: mute_switch
          then:
            - light.turn_on:
                id: top_led
                blue: 100%
                red: 100%
                green: 0%
                brightness: 100%
                effect: none
          else:
            - light.turn_off: top_led

  - id: set_volume
    mode: restart
    parameters:
      volume: float
    then:
      - media_player.volume_set:
          id: onju_out
          volume: !lambda return clamp(id(onju_out).volume+volume, 0.0f, 1.0f);
      - light.turn_on:
          id: top_led
          effect: show_volume
      - delay: 1s
      - script.execute: reset_led

  - id: turn_on_wake_word
    then:
      - if:
          condition:
            and:
              - binary_sensor.is_off: mute_switch
              - switch.is_on: use_wake_word
          then:
            - lambda: id(va).set_use_wake_word(true);
            - if:
                condition:
                  not:
                    - voice_assistant.is_running
                then:
                  - voice_assistant.start_continuous
            - script.execute: reset_led

  - id: turn_off_wake_word
    then:
      - voice_assistant.stop
      - lambda: id(va).set_use_wake_word(false);
      - script.execute: reset_led

  - id: calibrate_touch
    parameters:
      button: int
    then:
      - lambda: |-
          static byte thresh_indices[3] = {0, 0, 0};
          static uint32_t sums[3] = {0, 0, 0};
          static byte qsizes[3] = {0, 0, 0};
          static int consecutive_anomalies_per_button[3] = {0, 0, 0};

          uint32_t newval;
          uint32_t* calibration_values;
          switch(button) {
            case 0:
              newval = id(volume_down).get_value();
              calibration_values = id(touch_calibration_values_left);
              break;
            case 1:
              newval = id(action).get_value();
              calibration_values = id(touch_calibration_values_center);
              break;
            case 2:
              newval = id(volume_up).get_value();
              calibration_values = id(touch_calibration_values_right);
              break;
            default:
              ESP_LOGE("touch_calibration", "Invalid button ID (%d)", button);
              return;
          }

          if(newval == 0) return;

          //ESP_LOGD("touch_calibration", "[%d] qsize %d, sum %d, thresh_index %d, consecutive_anomalies %d", button, qsizes[button], sums[button], thresh_indices[button], consecutive_anomalies_per_button[button]);
          //ESP_LOGD("touch_calibration", "[%d] New value is %d", button, newval);

          if(qsizes[button] == 5) {
            float avg = float(sums[button])/float(qsizes[button]);
            if((fabs(float(newval)-avg)/avg) > id(thresh_percent)) {
              consecutive_anomalies_per_button[button]++;
              //ESP_LOGD("touch_calibration", "[%d] %d anomalies detected.", button, consecutive_anomalies_per_button[button]);
              if(consecutive_anomalies_per_button[button] < 10)
                return;
            } 
          }

          //ESP_LOGD("touch_calibration", "[%d] Resetting consecutive anomalies counter.", button);
          consecutive_anomalies_per_button[button] = 0;

          if(qsizes[button] == 5) {
            //ESP_LOGD("touch_calibration", "[%d] Queue full, removing %d.", button, id(touch_calibration_values)[thresh_indices[button]]);
            sums[button] -= (uint32_t) *(calibration_values+thresh_indices[button]);// id(touch_calibration_values)[thresh_indices[button]];
            qsizes[button]--;
          }
          *(calibration_values+thresh_indices[button]) = newval;
          sums[button] += newval;
          qsizes[button]++;
          thresh_indices[button] = (thresh_indices[button] + 1) % 5;

          //ESP_LOGD("touch_calibration", "[%d] Average value is %d", button, sums[button]/qsizes[button]);
          uint32_t newthresh = uint32_t((sums[button]/qsizes[button]) * (1.0 + id(thresh_percent)));
          //ESP_LOGD("touch_calibration", "[%d] Setting threshold %d", button, newthresh);

          switch(button) {
            case 0:
              id(volume_down).set_threshold(newthresh);
              break;
            case 1:
              id(action).set_threshold(newthresh);
              break;
            case 2:
              id(volume_up).set_threshold(newthresh);
              break;
            default:
              ESP_LOGE("touch_calibration", "Invalid button ID (%d)", button);
              return;
          }

switch:
  - platform: template
    name: Use Wake Word
    id: use_wake_word
    optimistic: true
    restore_mode: RESTORE_DEFAULT_ON
    on_turn_on:
      - script.execute: turn_on_wake_word
    on_turn_off:
      - script.execute: turn_off_wake_word

  - platform: template
    name: Listen LR
    id: listen_lr
    optimistic: true
    on_turn_on:
      - switch.turn_off: use_wake_word
      - delay: 1s
      - voice_assistant.start_continuous
    on_turn_off:
      - switch.turn_on: use_wake_word

Anything in the logs that might be useful for us?

No response

Additional information

No response

darki73 commented 3 months ago

From my observations, this is the issue with the Arduino framework.

Whenever i use the simplified configuration with esp-idf and just speaker + microphone + voice_assistant + micro_wake_word, it works every single time, even from 3 to 7 meters away from the speaker.

Turns out, and correct me if i am wrong, arduino framework is only capable of utilizing one core, and all logic runs on the main thread (hence why we are unable to use micro_wake_word with arduino framework).

Observing the traffic:

Satellite sends traffic to Home Assistant
Home Assistant sends traffic to Wake Word detection instance (i run it on a separate VM)
Gets the result back, usually with error that no wake word is detected
Either satellite or Home Assistant gets overwhelmed with information

Upon a closer look at the pipeline, it triggers wake word detection every 0.5 to 2 seconds whenever any source of voice/audio is around.

Sadly, while using ESP32-S3 module (i designed a custom PCB for that), you have to either rely on the arduino to behave (you can stop and start voice_assistant every 5 to 10 minutes, and it is still somewhat broken), or forget about the media_player component (i really wanted it as having a whole house audio system is kinda awesome) and just use your 55 USD (price per board + speaker + leds + mic + 3d printing an enclosure) as a simple voice assistant which works every single time no matter where you are in the room.

P.S. Yet another "fun" quirk of arduino framework in this case is the following piece of code:

media_player:
  - platform: i2s_audio
    id: "i2s_player"
    name: "${device_friendly_name} Media Player"
    dac_type: external
    i2s_audio_id: i2s_out
    i2s_dout_pin: GPIO17
    mode: mono
    on_play:
      - switch.turn_off: use_wake_word
    on_pause:
      - switch.turn_on: use_wake_word
    on_idle:
      - switch.turn_on: use_wake_word

IF, you are running voice_assistant in the continuous mode with arduino framework, upon playing anything through the speaker you will hear your audio chopped into a million pieces, this kinda solves the problem

P.P.S. I am aware of https://github.com/gnumpi/esphome_audio for media_player support on esp-idf, yet volume control is broken, constant crashes, so, there is that.

itnassol commented 3 months ago

HI Ivan,

Just brilliant, although this is way above my head, it has been interesting taking a deeper delve into this, thank you for your time with this, at the moment I have found a few work arounds in order to make it a little more compatible with what I am doing and everything seems to be working. As I have 5 (at the moment) converted Google minis around the house with more to follow, it dawned on me to only trigger the wake word... A. When the room is occupied and B. After it has been dormant for a few minutes.

So, each speaker is essentially in sleep mode until someone is in the room, I do this using ESP presence, as it is an old Victorian house with thick walls setting the ESP presence in each separate room is fairly simple, it then sets the speaker to listen when someone walks in, this allows me to just say a command as soon as I walk in like.... "Lights on" for example. the it goes to wake word. If the wake word is not used for 15 minutes I reload the integration and it's fine again. The only exception to this is if the system is asking me something, for example when it starts to get dark, the wake word turns off and the system asks, do I want evening mode, if I say Yes please it trigger evening mode, I I say No thanks it just goes back to wake word.

A bit Heath Robinson, but it works, and people are a little surprised when the system asks me if I want something done, lol...

Thank again.

darki73 commented 3 months ago

@itnassol you might want to give a shot to https://github.com/gnumpi/esphome_audio with the esp-idf framework, with the following options:

framework:
    type: esp-idf
    version: recommended
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
      CONFIG_ESP32_S3_BOX_BOARD: "y"
      COMPILER_OPTIMIZATION_SIZE: "y"

      CONFIG_ESP32_WIFI_STATIC_RX_BUFFER_NUM: "16"
      CONFIG_ESP32_WIFI_DYNAMIC_RX_BUFFER_NUM: "512"
      CONFIG_TCPIP_RECVMBOX_SIZE: "512"
      CONFIG_TCP_SND_BUF_DEFAULT: "65535"
      CONFIG_TCP_WND_DEFAULT: "512000"
      CONFIG_TCP_RECVMBOX_SIZE: "512"

and this as a settings for your i2s_audio:

---
external_components:
  - source:
      type: git
      url: https://github.com/gnumpi/esphome_audio
      ref: dev-next
    components:
      - adf_pipeline
      - i2s_audio
    refresh: 0s

i2s_audio:
  - id: i2s_in
    i2s_lrclk_pin: GPIO7
    i2s_bclk_pin: GPIO16
  - id: i2s_out
    i2s_lrclk_pin: GPIO8
    i2s_bclk_pin: GPIO18

adf_pipeline:
  - platform: i2s_audio
    id: adf_i2s_in
    type: audio_in
    i2s_audio_id: i2s_in
    i2s_din_pin: GPIO15
    pdm: false
    channel: left
    sample_rate: 16000
    bits_per_sample: 32bit
  - platform: i2s_audio
    id: adf_i2s_out
    type: audio_out
    i2s_audio_id: i2s_out
    i2s_dout_pin: GPIO17
    adf_alc: true
    alc_max: .5

microphone:
  - platform: adf_pipeline
    id: i2s_mic
    gain_log2: 3
    keep_pipeline_alive: false
    pipeline:
      - adf_i2s_in
      - self

media_player:
  - platform: adf_pipeline
    id: i2s_player
    name: "${device_friendly_name} Media Player"
    keep_pipeline_alive: false
    internal: false
    pipeline:
      - self
      - adf_i2s_out

I managed to get it working just enough for all the speakers being able to hear me no matter where i am in the house.

Just a side note, i am using MAX98357A as an external AMP and INMP441 (waiting for other mics to be delivered) so you might need to tweak some settings.

With the esp-idf i not have almost no issues (the device reboots sometimes due to ADF issues, but this is just the way it is, gnumpi did really amazing job with his library, but as far as i understand, he is a sole developer, so this is why it can be a hit or miss for some).

ESP32-S3 is a really powerful little chip, but it is handicapped by the Arduino framework, so if you want to get the true potential out of it, then ESP-IDF is the only way to go.

Also, here is the config for the micro_wake_word and chopped version of voice_assistant (chopped because you might want to use other settings, i just deleted all my custom code)

micro_wake_word:
  model: hey_jarvis
  on_wake_word_detected:
    - media_player.stop:
    - voice_assistant.start:

voice_assistant:
  id: assist
  microphone: i2s_mic
  media_player: i2s_player
  use_wake_word: false
  noise_suppression_level: 4
  auto_gain: 31dBFS
  volume_multiplier: 4.0
  on_client_connected:
    - if:
        condition:
          switch.is_on: use_wake_word
        then:
          - micro_wake_word.start:
  on_client_disconnected:
    - voice_assistant.stop:
    - micro_wake_word.stop:
  on_end:
    then:
      - voice_assistant.stop:
      - wait_until:
          not:
            voice_assistant.is_running:
      - if:
          condition:
            switch.is_on: use_wake_word
          then:
            - micro_wake_word.start:
  on_error:
    then:
      - voice_assistant.stop:
      - wait_until:
          not:
            voice_assistant.is_running:
      - if:
          condition:
            switch.is_on: use_wake_word
          then:
            - micro_wake_word.start:

switch:
  - platform: template
    id: use_wake_word
    name: Enable Voice Assistant
    optimistic: true
    restore_mode: RESTORE_DEFAULT_ON
    icon: mdi:assistant
    on_turn_on:
        - voice_assistant.stop:
        - delay: 1s
        - if:
            condition:
              not:
                - voice_assistant.is_running:
            then:
              - micro_wake_word.start:
    on_turn_off:
        - voice_assistant.stop:
        - micro_wake_word.stop:

itnassol commented 3 months ago

Thank you, This is all exciting stuff, I will give it a go today.

esphome / issues