NaomiProject / Naomi

The Naomi Project is an open source, technology agnostic platform for developing always-on, voice-controlled applications!
https://projectnaomi.com/
MIT License
242 stars 47 forks source link

Speaker Recognition #367

Open aaronchantrill opened 1 year ago

aaronchantrill commented 1 year ago

Description

This introduces a new "sr" plugin type, which allows Naomi to recognize users by voice.

Speaker recognition only happens during active speech recognition, because passive speech recognition needs to be fast.

The default "sr" plugin just passes back the name in the profile variable "first_name" without trying to recognize the speaker from the voice, which is basically how Naomi originally worked. The name of the speaker is embedded in the intent passed to the speechhandler as 'user', so it can be accessed as intent.get('user',''). The only plugin that is currently set up to use this is the shutdown plugin which may respond using the name of the user. The name of the user appears in parenthesis after the utterance if you have "print_transcript" on.

The setup still assumes en-US when downloading the VOSK models, which needs to be fixed to respect the "language" setting in the profile.

The VOSK speaker recognition is not terribly accurate. It also seems like you need to retrain your speaker recognition database from new recordings when you switch to different recording hardware.

Naomi does not record the speaker it thinks is speaking in the audiolog. You currently have to manually tag user utterances using the NaomiSTTTrainer.py program, although I would like to see the ability to learn voices while running by asking if unsure. With Vosk, if the cosine angle is less than 30, it is probably the correct speaker. If no voice matches with a cosine angle of less than 60, then it is most likely a new voice. Any time the best match is more than 30, Naomi should ask to verify who is talking.

Related Issue

Ability to recognize users by voice #267 VOSK STT Engine #280 Simplify the mic initialization #326

Motivation and Context

My goal is to eventually start building vocabulary profiles for different users, including acoustic models and pronunciation dictionaries. This could also be used to build unique profiles for users.

How Has This Been Tested?

$ python -m unittest discover ...s.....ss.....sssssssss

Ran 25 tests in 4.354s

OK (skipped=12)

Screenshots (if appropriate):

Types of changes

Checklist:

lgtm-com[bot] commented 1 year ago

This pull request introduces 5 alerts when merging dab0b6649d7c930edbff87b9fd7bbec4d932f10e into d0418fdca64f227a98017a48c172b98f9a9c3ea2 - view on LGTM.com

new alerts:

aaronchantrill commented 1 year ago

I just realized that this change has changed the behavior of the mic.active_listen() method, which now returns a dictionary including the name of the speaker, a numerical confidence indicator in the identity of the speaker (distance), and the transcription of the utterance. This doesn't matter most of the time, since the utterance is already passed to the speechhandler plugin in an intent object, but it does matter for plugins that call active_listen directly (like the frotz plugin). This does not affect plugins that use the expect or confirm methods. Frotz may be the only plugin affected right now.