Picovoice / eagle

On-device speaker recognition engine powered by deep learning
Apache License 2.0
29 stars 3 forks source link

Performance issue in Eagle #74

Closed LongMingWei closed 4 months ago

LongMingWei commented 4 months ago

Have you checked the docs and existing issues?

SDK

Python

Eagle package version

1.0.1

Framework version

Python 3.12

Platform

Windows (x86_64)

OS/Browser version

Windows 11

Describe the bug

I'm making a face recognition system with voice verification as well. It has a recognize user, register user and delete user function, where register user collects a new face and voice profile, and recognize user verifies the face and voice and see if they match.

I'm trying to store exported speaker profiles in a MongoDB database as binary data as it cannot accept Eagle Profiles as data. However, when I convert the binary data from MongoDB back into an Eagle speaker profile for comparison with input data when the software is running, the scores from Eagle's process function are at 0 most of the time. The main code is shown below. Any help will be appreciated!

`from pveagle import EagleProfile from pymongo.mongo_client import MongoClient from pymongo.server_api import ServerApi import cv2 import numpy as np import face_recognition from PIL import Image import gradio as gr from deepface import DeepFace import pveagle from pvrecorder import PvRecorder import sys import time

sys.path.append("Silent-Face-Anti-Spoofing") from test1 import test # from test1.py in the Silent-Face-Anti-Spoofing folder

gr.themes.builder()

uri = "hidden" client = MongoClient(uri, server_api=ServerApi('1')) collection = client['users']['faces'] collection.create_index("username", unique=True)

Load all embeddings into memory

user_embeddings = {}

def load_embeddings(): global user_embeddings user_embeddings = {} for user in collection.find(): stats = {} username = user['username'] embedding = np.array(user['embedding']) voice_profile = user.get('voice_profile', None) if voice_profile: voice_profile = EagleProfile.from_bytes(voice_profile) stats['face'] = embedding stats['speaker'] = voice_profile user_embeddings[username] = stats print(username)

load_embeddings()

access_key = "hidden"

state = 1 usersname=""

def recognize_user(image): global state, usersname if state == 1: img = np.array(image) img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Anti-Spoofing
    label = test(
        image=img_rgb,
        model_dir='./Silent-Face-Anti-Spoofing/resources/anti_spoof_models',
        device_id=0
    )

    if label == 1:
        if len(face_recognition.face_encodings(img_rgb)) == 0:
            return "No person detected"
        else:
            name = recognize(img_rgb)
            if name in ['unknown_person', 'no_persons_found']:
                return "Unknown user. Please register if you have not or try again."
            else:
                state = 2
                result = DeepFace.analyze(img_rgb, actions=['emotion'], enforce_detection=False)
                mood = result[0]['dominant_emotion']
                usersname = name
                if name.endswith('_0001') or name.endswith('_0002') or name.endswith('_0003'):
                    name = name[:-5]
                return f"Hello {name}, you look {mood} today! Verify your voice to enter."
    else:
        name = recognize(img_rgb)
        if name in ['unknown_person', 'no_persons_found']:
            return "Hello unknown fake NPC. You shall not pass!"
        else:
            if name.endswith('_0001') or name.endswith('_0002') or name.endswith('_0003'):
                name = name[:-5]
            return f"You think you are {name}? You shall not pass!"
else:
    state = 1
    speaker_profile = user_embeddings[usersname]['speaker']
    if not speaker_profile:
        return "This profile has no voice registered. (Celebrity)"
    eagle = pveagle.create_recognizer(access_key=access_key, speaker_profiles=[speaker_profile])
    recognizer_recorder = PvRecorder(device_index=-1, frame_length=eagle.frame_length)
    recognizer_recorder.start()
    sum = 0
    for i in range(50):
        audio_frame = recognizer_recorder.read()
        sum += eagle.process(audio_frame)[0]
    recognizer_recorder.stop()
    eagle.delete()
    recognizer_recorder.delete()

    score = sum/50
    if score >= 0.6:
        return "Access granted. Have a great day!"
    else:
        return "Voice does not match. Try again."

rstate = 1 rname = "" rembed = None

def register_user(name, image): global rstate, rname, rembed if rstate == 1: if not name: return "Name field empty."

    img = np.array(image)
    embeddings = face_recognition.face_encodings(img)
    if len(embeddings) == 0:
        return "No face detected in the image."
    else:
        if name in user_embeddings:
            return "Name already taken."
        rstate = 2
        rname = name
        embedding = embeddings[0]
        rembed = embedding
        return "Press 'Submit' again, then speak into the microphone until voice calibration is complete."
else:
    eagle_profiler = pveagle.create_profiler(access_key=access_key)
    recorder = PvRecorder(device_index=-1, frame_length=eagle_profiler.min_enroll_samples)
    recorder.start()
    enroll_percentage = 0.0
    while enroll_percentage < 100.0:
        audio_frame = recorder.read()
        enroll_percentage, feedback = eagle_profiler.enroll(audio_frame)
        print(enroll_percentage)
    recorder.stop()

    speaker_profile = eagle_profiler.export()
    document = {
        'username': rname,
        'embedding': rembed.tolist(),  # Convert to list for MongoDB storage,
        'voice_profile': speaker_profile.to_bytes()
    }
    collection.insert_one(document)
    user_embeddings[rname] = {}  # Add new data entry to memory
    user_embeddings[rname]['face'] = rembed
    user_embeddings[rname]['speaker'] = speaker_profile.to_bytes()

    rstate = 1
    eagle_profiler.delete()
    recorder.delete()
    return f"User {rname} registered successfully!"

def delete_user(name): if not name: return "Name field empty."

result = collection.delete_one({'username': name})
if result.deleted_count > 0:
    user_embeddings.pop(name)  # Remove data entry from memory
    return f'User {name} deleted successfully!'
else:
    return f'User {name} not found!'

def recognize(img): embeddings_unknown = face_recognition.face_encodings(img) if len(embeddings_unknown) == 0: return 'no_persons_found' else: embeddings_unknown = embeddings_unknown[0]

match = False
threshold = 0.6
mini = threshold
min_dis_id = 'unknown_person'
for username, embedding in user_embeddings.items():
    face_embedding = embedding['face']
    distance = face_recognition.face_distance([face_embedding], embeddings_unknown)[0]
    if distance < mini:
        mini = distance
        min_dis_id = username
        match = True
return min_dis_id if match else 'unknown_person'

Frontend interface

theme = gr.themes.Soft( primary_hue="slate", secondary_hue="slate", font=[gr.themes.GoogleFont('Roboto'), gr.themes.GoogleFont('Montserrat'), 'ui-sans-serif', 'sans-serif'] )

iface_recognize = gr.Interface( fn=recognize_user, inputs=[gr.Image(source="webcam", streaming=True)], outputs=[gr.HTML()], live=True, every=1, title="Face Recognition Attendance System", allow_flagging='never', clear_btn=None )

iface_register = gr.Interface( fn=register_user, inputs=[gr.Textbox(label="Enter new user name."), gr.Image(source="webcam", streaming=True)], outputs=[gr.HTML()], title="Register New User", live=False, allow_flagging='never', clear_btn=None )

iface_delete = gr.Interface( fn=delete_user, inputs=[gr.Textbox(label="Enter user name to delete")], outputs=[gr.HTML()], title="Delete User", live=False, allow_flagging='never', clear_btn=None )

custom_css = """ footer {display: none !important;} label.float {display: none !important;} div.stretch button.secondary {display: none !important;} .panel .pending {opacity: 1 !important;} """

iface = gr.TabbedInterface([iface_recognize, iface_register, iface_delete], ["Recognize User", "Register User", "Delete User"], css=custom_css, theme=theme)

if name == "main": iface_recognize.dependencies[0]["show_progress"] = False iface_register.dependencies[0]["show_progress"] = False iface_delete.dependencies[0]["show_progress"] = False iface.launch()`

Steps To Reproduce

Just look at the code to see if there's anything wrong first or there is something about MongoDB or Eagle that I am unaware about. If the issue can't be solved, I can send the repo used.

Expected Behavior

High scores returned from Eagle's process function when voices from the same user are being compared

laves commented 4 months ago

@LongMingWei - can you reproduce this issue by saving and loading the model from your hard drive? If it is only present when you are using MongoDB, then that would probably not be an issue with Eagle, but more likely an issue with the database code.

LongMingWei commented 4 months ago

This is my new test script without using MongoDB:

` import pveagle from pveagle import EagleProfile from pvrecorder import PvRecorder

access_key = "hidden"

eagle_profiler = pveagle.create_profiler(access_key=access_key) recorder = PvRecorder(device_index=-1, frame_length=eagle_profiler.min_enroll_samples) recorder.start() enroll_percentage = 0.0 while enroll_percentage < 100.0: audio_frame = recorder.read() enroll_percentage, feedback = eagle_profiler.enroll(audio_frame) print(enroll_percentage) recorder.stop()

speaker_profile = eagle_profiler.export() voicebyte = speaker_profile.to_bytes() eagle_profiler.delete() recorder.delete()

speaker_profile = EagleProfile.from_bytes(voicebyte) eagle = pveagle.create_recognizer(access_key=access_key, speaker_profiles=[speaker_profile]) recognizer_recorder = PvRecorder(device_index=-1, frame_length=eagle.frame_length) recognizer_recorder.start()

for i in range(1000): audio_frame = recognizer_recorder.read() print(eagle.process(audio_frame)[0]) recognizer_recorder.stop() eagle.delete() recognizer_recorder.delete()

`

When recognizing my voice it was successfully detected at first, however as time went by the score went back to 0 even though I was still speaking. May I also check if there's a way for the profiler to create a new voice profile faster while still being accurate as well since the time to profile a voice is currently quite long. Thanks for the help!

mrrostam commented 4 months ago

Thank you, @LongMingWei, for providing the sample code. We ran it but were unable to reproduce the issue you described. Eagle continues to send meaningful scores even after several minutes.

If you're observing unusual behavior in the results, it may be due to the enrollment process. For testing, I suggest recording your voice using a different application, saving the audio file, and listening to it to ensure the quality is acceptable. Then, pass this file directly to the profiler instead of using pv_recorder and see if there is any difference.

LongMingWei commented 4 months ago

Thanks for prompt help given so far! There's a problem I couldn't figure out however. When I try to directly pass in the audio file into eagle_profiler.enroll(audiofile), there is a TypeError: only integer scalar arrays can be converted to a scalar index. I checked the documentation and apparently pvrecorder.read() converts the audio file into numbers. However, I couldn't find in the documentation how to directly convert this audiofile into this list of numbers that the eagle profiler can accept. Thank you!

mrrostam commented 4 months ago

As mentioned in the documentation, the enroll function accepts PCM signal only. You'll need to read the audio file using another library and then pass the PCM data to the Eagle profiler.

For an example, please take a look at this Eagle demo in which wave files are read and passed to the enroll function. This demonstrates how to properly extract and format PCM data for use with the Eagle profiler.

LongMingWei commented 4 months ago

Hello! I managed to figure out how to convert the audio data into PCM data to enroll the voice, however something very weird happened. The results are very inconsistent, as one time I recorded my voice and it could detect it, but sometimes when I was not speaking it still detected as my voice. There was also another time when the old problem appeared again where the scores would be 0 throughout. I did try to downsample the audio file to match the required sample rate of pvrecorder, do you think that's the problem or it's another problem that I did not realise? Thank you!

`
def register_user(name, voice, image): #Function to register new user including voice

if not name:
    return "Name field empty."
elif name in user_embeddings:
    return "Name already taken."
if not voice:
    return "Voice profile missing."

img = np.array(image)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
embeddings = face_recognition.face_encodings(img_rgb)

# Anti-Spoofing
label = test(
    image=img_rgb,
    model_dir='./Silent-Face-Anti-Spoofing/resources/anti_spoof_models',
    device_id=0
)

if len(embeddings) == 0:
    return "No face detected in the image."
elif label != 1:
    return "Face is fake. Registration denied."

embedding = embeddings[0]

eagle_profiler = pveagle.create_profiler(access_key=access_key)
audio = read_file(voice, eagle_profiler.sample_rate)
enroll_percentage, feedback = eagle_profiler.enroll(audio)
if enroll_percentage < 100.0:
    return "Unable to register voice profile. Speak for a longer period of time."
else:
    speaker_profile = eagle_profiler.export()

document = {
    'username': name,
    'embedding': embedding.tolist(),  # Convert to list for MongoDB storage,
    'voice_profile': speaker_profile.to_bytes()
}
collection.insert_one(document)
user_embeddings[name] = {}  # Add new data entry to memory
user_embeddings[name]['face'] = embedding
user_embeddings[name]['speaker'] = speaker_profile

eagle_profiler.delete()

return f"User {name} registered successfully!"`

`def read_file(file_name, sample_rate): # Function used in the register user function

rate_match = True

with wave.open(file_name, mode="rb") as wav_file:
    channels = wav_file.getnchannels()
    sample_width = wav_file.getsampwidth()
    num_frames = wav_file.getnframes()

    params = wav_file.getparams()
    params = list(params)

    if wav_file.getframerate() != sample_rate:
        params[3] = sample_rate
        print(params)
        print("Audio file should have a sample rate of %d. got %d" % (sample_rate, wav_file.getframerate()))
    if sample_width != 2:
        raise ValueError("Audio file should be 16-bit. got %d" % sample_width)
    if channels == 2:
        print("Eagle processes single-channel audio but stereo file is provided. Processing left channel only.")

if not rate_match:
    with wave.open(file_name, mode="wb") as wav_file:
        wav_file.setparams(params)

with wave.open(file_name, mode="rb") as wav_file:
    samples = wav_file.readframes(num_frames)

frames = struct.unpack('h' * num_frames * channels, samples)

return frames[::channels]`
mrrostam commented 4 months ago

It seems that saving and loading the speaker profile is working correctly then. Your question seems to be more concerned with Eagle's performance. Could you please try our demos to see if you notice the same unusual behavior in scoring?

LongMingWei commented 4 months ago

Hello! Seems like the same unusual behavior in scoring has happened even when using the demo for the mic, with scores remaining stuck at 0. Also when enrolling, sometimes even when I'm not speaking in a quiet area, it still detects as good audio. I also noticed there were 2 detected mics in my device when using show audio devices, but I could not find any documentation in the demo page and help command that shows how to change the mic used for enrolling and testing, that could be one of the possible problems. Thank you!

mrrostam commented 4 months ago

As we haven't observed the behavior you're describing on our end, we suspect there may be a configuration issue with your setup. We cannot assist you further unless you provide sample audio files for both enrollment and verification.

Regarding your demo questions, each demo includes a help option accessible with the --help flag, which provides comprehensive explanation on each flag. Additionally, detailed explanations can be found in the accompanying readme file. Please ensure you use all these resources before asking your questions here.

To specify the recording device, use the --audio_device_index option. To verify audio input to Eagle, use the --output_audio_path option in the mic demo to save recorded audio to a file for playback and analysis.

LongMingWei commented 4 months ago

Hello! After testing, seems like the problem was indeed the audio_device_index, thank you! Just fyi, a lot of the functions like --audio_device_index and --output_audio_path are not mentioned in this demo readme and using --help also gave very little info as shown below, so maybe if there was any info I have missed you can add that info in the readme docs so that people can access it easier. image

ksyeo1010 commented 4 months ago

Thanks for letting us know, we will update the docs later.