NaomiProject / Naomi

The Naomi Project is an open source, technology agnostic platform for developing always-on, voice-controlled applications!
https://projectnaomi.com/
MIT License
243 stars 50 forks source link

VAD plugin #144

Closed aaronchantrill closed 5 years ago

aaronchantrill commented 5 years ago

Detailed Description

I want to be able to use different Voice Activity Detection programs like WebRTCVAD to see if we can get a better estimate of when someone is talking or not. Right now we are only using noise levels to detect when to start recording and send data to the STT parser.

Context

Currently, Naomi records whenever it hears a loud noise. It then sends this recording to Pocketsphinx or some other STT engine for parsing. Pocketsphinx is quite likely to interpret this noise as a word (usually "BUT" or "OF", but occassionally will hear "Naomi" and start active listening.

Since I am planning to roll out a system for recording audio clips and transcriptions that can be used for WER analysis and then used to train STT engines, I would like to reduce the number of accidental recordings.

Possible Implementation

I would create a new plugin category, VAD. I would turn the current strategy of recording based on Signal to Noise Ratio into a plugin called SNRVAD and also implement WebRTCVAD as a plugin. I would also implement a couple of auditing tools that would be able to measure how often audio gets passed to the STT engine erroneously and how long the process of listening and responding takes, to measure whether a particular VAD engine adds to the overall processing latency.

TuxSeb commented 5 years ago

This does need a milestone

G10DRAS commented 5 years ago

WebRTCVAD is fast enough on RPi2B. and fetchThreshold() is easy to replace with is_speech() of webrtcvad lib.

aaronchantrill commented 5 years ago

@G10DRAS yes, incorporating WebRTCVAD was the primary motivator for this project. I completed that a couple of weeks ago, and have a pull request waiting. It appears to me that the primary motivation for having multiple threads constantly checking the audio stream for the wake word was because the response time was too slow when checking for the wake word chunk by chunk, especially if a network or cloud service was involved. It was pretty cool, but also expensive computationally, and the re-write is streamlined somewhat in a way that still works for me and should be easier to work with/understand.

AustinCasteel commented 5 years ago

PR has been merged. Can reopen issue or create a new one if topic comes up again.