Human voice activity detection (AIS-1287)

tareqAmen commented 11 months ago

can any one help me to build esp32 code make human voice activity detection using esp-skainet framework? In my project, I want to turn on my device when there is only a human voice, so that it is not affected by any other sound, such as the sound of the wind or any other sounds in the environment.

I'm using esp32s3 with inmp441 i2s mic.

feizi commented 11 months ago

AFE has supported VAD. You can get the vad state by afe->fetch(). If you do not want to run other modules in AFE, e.g. wakenet, BSS, AEC, you can disable them.

/**
* @brief The result of fetch function
*/
typedef struct afe_fetch_result_t
{
int16_t *data;                          // the data of audio.
int data_size;                          // the size of data. The unit is byte.
int wakeup_state;                       // the value is wakenet_state_t
int wake_word_index;                    // if the wake word is detected. It will store the wake word index which start from 1.
int vad_state;                          // the value is afe_vad_state_t
int trigger_channel_id;                 // the channel index of output
int wake_word_length;                   // the length of wake word. It's unit is the number of samples.
int ret_value;                          // the return state of fetch function
void* reserved;                         // reserved for future use
} afe_fetch_result_t;

tareqAmen commented 11 months ago

AFE has supported VAD. You can get the vad state by afe->fetch(). If you do not want to run other modules in AFE, e.g. wakenet, BSS, AEC, you can disable them.

/**
* @brief The result of fetch function
*/
typedef struct afe_fetch_result_t
{
int16_t *data;                          // the data of audio.
int data_size;                          // the size of data. The unit is byte.
int wakeup_state;                       // the value is wakenet_state_t
int wake_word_index;                    // if the wake word is detected. It will store the wake word index which start from 1.
int vad_state;                          // the value is afe_vad_state_t
int trigger_channel_id;                 // the channel index of output
int wake_word_length;                   // the length of wake word. It's unit is the number of samples.
int ret_value;                          // the return state of fetch function
void* reserved;                         // reserved for future use
} afe_fetch_result_t;

I tested (vad), but it responds to the wind noise, and I want to avoid this problem, please, can you suggest me solutions in this field?

jayavanth commented 11 months ago

@tareqAmen you can try setting .vad_mode = VAD_MODE_4 in afe_config_t variable. It can go from 0-4

Default is 3 and 0 is most sensitive

https://github.com/espressif/esp-sr/blob/455314a90cac59d4f50253cf719659f0b9f5d778/include/esp32s3/esp_vad.h#L30

espressif / esp-skainet

Human voice activity detection (AIS-1287) #106