cosine distance between different people is < 0.5

markwillowtree commented 3 months ago

I based the code below on the example main.py from the hugging face model page.

Identical images produce a very small distance of 4.919836871231098e-09 as expected.

The different images used here produce a distance of 0.3032730731332305.

The aligned images appear to have been processed correctly at 112x112.

jennifer_aniston jpg_aligned

David_Schwimmer jpg_aligned

The original images can be found here.

https://resizing.flixster.com/-XZAfHZM39UwaGJIFWKAE8fS0ak=/v3/t/assets/30905_v9_bc.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/3/30/David_Schwimmer_2011.jpg/800px-David_Schwimmer_2011.jpg

Apologies if this is a mistake on my part.

import numpy as np
import onnxruntime as rt
import mediapipe as mp
import cv2
import time
from skimage.transform import SimilarityTransform
from scipy.spatial.distance import cosine

# ---------------------------------------------------------------------------------------------------------------------
# INITIALIZATIONS

# Target landmark coordinates for alignment (used in training)
LANDMARKS_TARGET = np.array(
    [
        [38.2946, 51.6963],
        [73.5318, 51.5014],
        [56.0252, 71.7366],
        [41.5493, 92.3655],
        [70.7299, 92.2041],
    ],
    dtype=np.float32,
)

def compare(img_path1, img_path2):
    img1_embedding = infer(img_path1)
    img2_embedding = infer(img_path2)

    cos_dist = cosine(img1_embedding, img2_embedding)

    if cos_dist <= 0.5:
        print(f'{img_path1} and {img_path2} are the same')
    else:
        print(f'{img_path1} and {img_path2} are different')

    print(f'cosine distance = {cos_dist}')

def infer(img_path):
    img = cv2.imread(img_path)

    # Process the image with the face detector
    FACE_DETECTOR = mp.solutions.face_mesh.FaceMesh(
        refine_landmarks=True, min_detection_confidence=0.5, min_tracking_confidence=0.5, max_num_faces=1
    )
    result = FACE_DETECTOR.process(img)

    if result.multi_face_landmarks:
        # Select 5 Landmarks (Eye Centers, Nose Tip, Left Mouth Corner, Right Mouth Corner)
        five_landmarks = np.asarray(result.multi_face_landmarks[0].landmark)[[470, 475, 1, 57, 287]]

        # Extract the x and y coordinates of the landmarks of interest
        landmarks = np.asarray(
            [[landmark.x * img.shape[1], landmark.y * img.shape[0]] for landmark in five_landmarks]
        )

    else:
        print(f"No faces detected in {img_path}")
        exit()

    # ---------------------------------------------------------------------------------------------------------------------
    # FACE ALIGNMENT

    # Align Image with the 5 Landmarks
    tform = SimilarityTransform()
    tform.estimate(landmarks, LANDMARKS_TARGET)
    tmatrix = tform.params[0:2, :]
    img_aligned = cv2.warpAffine(img, tmatrix, (112, 112), borderValue=0.0)
    # safe to disk
    cv2.imwrite(f"{img_path}_aligned.jpg", img_aligned)

    # ---------------------------------------------------------------------------------------------------------------------
    # FACE RECOGNITION

    # Inference face embeddings with onnxruntime
    input_image = (np.asarray([img_aligned]).astype(np.float32)).clip(0.0, 255.0).transpose(0, 3, 1, 2)
    FACE_RECOGNIZER = rt.InferenceSession("FaceTransformerOctupletLoss.onnx", providers=rt.get_available_providers())
    embedding = FACE_RECOGNIZER.run(None, {"input_image": input_image})[0][0]

    return embedding

if __name__ == "__main__":
    ds = "David_Schwimmer.jpg"
    ja = "jennifer_aniston.jpg"

    compare(ja, ds)

Martlgap commented 3 months ago

Indeed, the FaceTransformer + OctupletLoss model, tends to project features more closely than for example ArcFace + Octuplet. You have to set the threshold based on your preferences. I can imagine that for LFW dataset, I guess the best threshold was approx. 0.27 - so you have to adjust the threshold to you needs. (0.5 was just a sample value).

Hope that helps you?

markwillowtree commented 3 months ago

Thanks for the response Martin.

I'm still having trouble with the cosine distances when comparing the embeddings of high and low resolution images together, they're all very low, less than 0.1.

I'm going to double check my code, but in the meantime, could you advise if I should be doing more image pre-processing other than whats already in the code example above?

Cubis13 commented 2 months ago

I'm experiencing a similar problem. How do I set the threshold default for face verification in real data?

Martlgap commented 1 month ago

Well - make sure the images have float pixel values ranging from 0 to 1. And make sure your images are stored in RGB format - cv2 for example uses RBG format.

Martlgap / octuplet-loss

cosine distance between different people is < 0.5 #9