dscripka / openWakeWord

An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Apache License 2.0
758 stars 74 forks source link

[feature idea] Custom Verifier Model also outputs which voice/person spoke the wake word #21

Closed dalehumby closed 1 year ago

dalehumby commented 1 year ago

I have the need, and given conversations on the Rhasspy forums I think others do too, to not only know which wake word was used, but who spoke that word.

Would it be possible to extend the customer verifier to output an ID of a previously onboarded voice?

For example, Bob and Jane live together and use Rhasspy+openWakeWord. They both recorded positive and negative samples for the custom verifier. Bob says "alexa, set my alarm for 5am". The wake word detector outputs that the wake word was 'alexa', and it was (with some confidence score?) spoken by Bob.

This would allow custom intent handling based on the person speaking. In this example, setting Bob's alarm.

Perhaps there is another way to do this? For example on the full utterance, which would increase voice match accuracy?

I don't have any intention of using this for security ("alexa, unlock my front door")

dscripka commented 1 year ago

This is a good idea, and as you point a natural extension of the custom verifier model concept.

I've done some minimal exploration around this functionality, and I suspect there will be a few challenges:

1) The custom verifier models are quite simple, and likely don't have the capacity to both filter the predictions of the primary model and verifier the speaker.

2) Speaker identification models come in two types: text-dependent and text-independent. The latter are certainly more flexible, and in the example you shared would enable using the entire utterance to improve the matching performance. However, text-independent models are harder to train and may be too computationally demanding to fit within the openWakeWord design. Text-dependent identification may be feasible though, as the limitation to a known word/phrase may allow for the shared audio embedding from openWakeWord to provide sufficient performance for this task.

Currently, this 2nd option is the approach that I am pursuing, and if it seems promising I will hopefully include an initial model in an upcoming release of openWakeWord. One important caveat is that even if this approach works, performance will likely have practical limitations. I doubt detection will be reliable for more than a small number of voices, and to your point, would absolutely not be suitable for security/verification purposes.

dalehumby commented 1 year ago

Thanks for the rapid response.

I imagine that in a family home there would be 1,2, max 5? onboarded voices. For unknown voices a very low confidence score would fallback to an "unknown" user and default handling. (In my example above, setting a generic alarm.)

Please let me know if you need help with testing or early feedback.

dscripka commented 1 year ago

Closing this issue and moving the topic to a newly created Discussion thread (#22).

@dalehumby, if you are still interesting in providing some feedback, that would be useful!