esphome / feature-requests

ESPHome Feature Request Tracker
https://esphome.io/
413 stars 26 forks source link

Use Espressif ESP-SR speech recognition framework to allow local on-device Wake Word on ESP32? #2319

Open Hedda opened 1 year ago

Hedda commented 1 year ago

Describe the problem you have/What new integration you would like

Use "WakeNet Wake Word Engine" from ESP-SR (Espressif Speech Recognition framework) to enable on-device Wake Word on ESP32.

https://www.cnx-software.com/2023/07/17/espressif-esp-sr-enables-on-device-speech-recognition-framework-on-esp32-s3-and-esp32-wisocs/

ESP-SR framework v1.0 was first released on December 17, 2021 and the ESP-SR v1.20 update was introduced in March of 2023

https://github.com/espressif/esp-sr

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Espressif wake word engine WakeNet is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen to wake words, such as “Alexa”, “Hi,lexin” and “Hi,ESP”. You can refer to Model loading method to build your project.

Currently, Espressif has not only provided an official wake word "Hi,Lexin","Hi,ESP" to the public for free, but also allows customized wake words. For details on how to customize your own wake words, please see Espressif Speech Wake Words Customization Process.

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/README.html

Please describe your use case for this integration and alternatives you've tried:

Local on-device Wake Word with ESPHome on ESP32 (ESP32-S3 or better) as part of Home Assistant's year of Voice effort.

https://www.home-assistant.io/blog/2022/12/20/year-of-voice/

Additional context

That Project utilizes the ESP32-S3 microcontroller and INMP441 Omnidirectional Microphone to enable voice command functionality by leveraging the Espressif Speech Recognition framework, the system can recognize wake-up words like "Hey Siri", and "Ok Google" and execute specific actions associated with those commands. The voice user interface provides a convenient means of interaction in screenless environments and can be applied to a wide range of projects. The implementation involves configuring the ESP32-S3 and integrating the INMP441 microphone to capture audio input, processing the speech recognition using the framework, and linking recognized commands to predefined actions. Overall, this project enables users to control various tasks and functions by speaking voice commands, enhancing usability in scenarios where traditional screen-based interfaces are not available.

https://twitter.com/EspressifSystem/status/1680232917476446208

https://www.youtube.com/watch?v=3XbnzfBjmZk&ab_channel=ThatProject

https://www.youtube.com/watch?v=qq2FRv0lCPw&ab_channel=ThatProject

h3ndrik commented 11 months ago

Another approach would be using Tensorflow Lite for Microcontrollers. AFAIK it's one of the very few open-source solutions around.

Edit: https://github.com/kahrendt/esphome-on-device-wake-word