Closed kristiankielhofner closed 1 year ago
Hi @kristiankielhofner , Yes, we can return the amplitude of the wake word audio when one wake word is detected. We will add the amplitude into afe_fetch_result_t.
Thanks, that would be great!
We also see cases with the ESP BOX where wake is very sensitive (a good thing) but the audio level is so low getting accurate command recognition/transcripts for the speech after is nearly impossible. I think it would be very useful generally for esp-sr users to be able to programmatically define a minimum amplitude threshold depending on the needs of their application.
In fact, we have an automatic gain adjustment method, that is, when the wake word is detected, it will adjust a suitable gain according to the amplitude of the wake word to ensure a good speech recognition performance. However, this gain is only modified after the wake word is triggered (not very flexible, but this method is still very effective considering that the wake word is usually triggered first).
User can close this method when set agc_mode=AFE_MN_PEAK_NO_AGC
. Otherwise, the gain of the output audio will be automatically adjusted.
agc_modeconfigures peak agc mode. Note that, this parameter is only for speech recognition scenarios, and is only valid when WakeNet is enabled:
AFE_MN_PEAK_AGC_MODE_1 : feed linearly amplified audio signals to MultiNet, peak is -5 dB.
AFE_MN_PEAK_AGC_MODE_2 : feed linearly amplified audio signals to MultiNet, peak is -4 dB.
AFE_MN_PEAK_AGC_MODE_3 : feed linearly amplified audio signals to MultiNet, peak is -3 dB.
AFE_MN_PEAK_NO_AGC : feed original audio signals to MultiNet.
@kristiankielhofner , we usually adjust the gain of the codec(ADC) to set the initial gain, which gives a slightly better dynamic range. But considering the complexity of the operation, I think it's also a good idea to set an initial amplitude gain in software.
@feizi We've experimented with these values a bit.
My understanding is they only apply when multinet is in use? Our users have the option of either using multinet or streaming audio after wake/vad to our inference server implementation (which uses Whisper) to transcribe any voice command.
When using the streaming approach users can have very complex commands such as requesting specific songs from Spotify, weather for cities, calendar appointments, connection with ChatGPT, etc. With this approach we have good quality audio and accurate transcription but the accuracy drops off at longer distances (five-six meters or so). We see the same variation at these distances when users have a high number of multinet commands with complex/similar command definitions.
Many of our users have multiple devices well within wake range. When these users issue the wake word multiple devices wake and begin streaming or recognizing commands with multinet. Depending on a variety of factors (speaker, environment, complexity of command, distance, etc) this causes the commands to be duplicated which isn't great but the real issue is we also provide audio feedback on the result of the command, so they will get multiple TTS responses or chimes for success/error. In my environment, for example, I often have as many as three devices waking at a time and when I issue commands they are followed up with several TTS responses from the various devices.
It's a bigger problem when the furthest device at the very edge of wake detection range attempts to do transcription/command recognition because it's often incorrect, which is problematic because erroneous commands can be issued to their connected platforms.
If we can get the amplitude of the audio at wake we have a mechanism in Willow to select the specific device with the highest audio level and not stream to Whisper or process multinet commands on the others.
We currently set the initial ADC gain and allow users to change it if they wish although I don't think changing our default is common.
My understanding is they only apply when multinet is in use? Our users have the option of either using multinet or streaming audio after wake/vad to our inference server implementation (which uses Whisper) to transcribe any voice command.
This method was originally designed to improve the accuracy of multinet, but it can be used without multinet. We will add an option to change the initial gain in AFE, although we recommend directly adjusting the ADC gain.
We will add an option to change the initial gain in AFE
We set and can change the ADC gain, I'm not sure what this adds for this issue?
We really appreciate all of your assistance but all we're looking for at the moment is the amplitude of wake word audio as referenced above.
We set and can change the ADC gain, I'm not sure what this adds for this issue?
Just emphasizing that adjusting the ADC gain is a more appropriate approach.
We really appreciate all of your assistance but all we're looking for at the moment is the amplitude of wake word audio as referenced above.
OK, this function is unber development and test.
Thank you very much, we really appreciate your work on this and we're really looking forward to this functionality!
Hi @kristiankielhofner , the feature has beed merged.
The volume of wake word audio is in afe_fetch_result_t struct, please refer to https://github.com/espressif/esp-sr/blob/master/include/esp32s3/esp_afe_sr_iface.h#L28.
You can set initial linear gain by afe_linear_gain by afe_config.afe_linear_gain
https://github.com/espressif/esp-sr/blob/master/include/esp32s3/esp_afe_config.h#L75
When your device has two or more microphones, I recommend using volume while the wake-up state is WAKENET_CHANNEL_VERIFIED. This ensures that the volume you get is on the correct channel, at the cost of an increase in latency of about 100ms. Let me briefly explain it. For the mutli-microphone algorithm, our model will first return WAKENET_DETECTED , and then wait for 3 frames before returning WAKENET_CHANNEL_VERIFIED.
When using multiple devices, it is important to keep the volume of each device normalized. Generally, the gain of the microphone will vary by +-3dB.
@sun-xiangyu
Thank you for this! An example of websocket events from our current (very early) implementation. With three devices in "wake range":
I (18:05:31.848) WILLOW/WAS: received text data on WebSocket: {
"wake_start": {
"hostname": "willow-f412fafd0ecc",
"wake_volume": -22.639871597290039
}
}
I (18:05:31.854) WILLOW/WAS: received text data on WebSocket: {
"wake_start": {
"hostname": "willow-7cdfa1e189cc",
"wake_volume": -33.2228775024414
}
}
I (18:05:31.887) WILLOW/WAS: received text data on WebSocket: {
"wake_start": {
"hostname": "willow-7cdfa1e1aa84",
"wake_volume": -19.98358154296875
}
}
Approximate distances from speaker in this example:
willow-7cdfa1e1aa84
- 1 meter
willow-f412fafd0ecc
- 2 meters
willow-7cdfa1e189cc
- 4 meters (indirectly)
Based on these early test results this is very useful for us and exactly what we were looking for!
Thanks again!
With Willow we have an issue where multiple ESP devices within audio range of the person speaking wake simultaneously.
We are building a messaging system to detect this case but determining which device should "win" on wake is a problem. One thought we have is to get the amplitude level of the speech/audio to select the ESP where the audio is loudest.
Is this possible with esp-sr?