ageitgey / face_recognition

The world's simplest facial recognition api for Python and the command line
MIT License
52.94k stars 13.44k forks source link

Best way to cluster unknown faces? #359

Closed expenses closed 6 years ago

expenses commented 6 years ago

Hi, I'm looking for the best way to cluster a lot of unknown faces. The simple solution would be to do something like:

import face_recognition

faces = []

def add_faces(filename):
    image = face_recognition.load_image_file(filename)
    encoding = face_recognition.face_encodings(image)[0]

    for i, existing_encoding in enumerate(faces):
        if face_recognition.compare_faces([existing_encoding], encoding)[0]:
            return i

    faces.append(encoding)
    return len(faces) - 1

But I could see that having performance problems. Do you have any other ideas or suggestions?

ageitgey commented 6 years ago

Each face encoding is a point in a 128-dimension space that represents that face's position in the space of all possible faces. You an cluster those points directly using any standard clustering algorithm such as K-Nearest-Neighbors.

There's an example in the examples folder of training a KNN classifier to do that. Here's the link: https://github.com/ageitgey/face_recognition/blob/master/examples/face_recognition_knn.py

expenses commented 6 years ago

@ageitgey Thanks you, I should have done more looking into actual clustering algorithms. I should clarify that I'm looking for one that is incremental so I can add in encodings over time, and also determines the number of clusters while it's running.

ageitgey commented 6 years ago

First just a side note about your sample code:

    for i, existing_encoding in enumerate(faces):
        if face_recognition.compare_faces([existing_encoding], encoding)[0]:
            return i

That could potentially be much faster written like this:

matches = face_recognition.compare_faces(faces, encoding)
first_match_index = matches.index(True)

... because that would take advantage of any parallel processing capabilities of your CPU using numpy behind the scenes. Doing it in a for loop will be a lot slower in practice on most CPUs unless you have tons of duplicate faces in your dataset (where the early return might help).

Now back to your real question:

The incremental part is tricky. I can think of some esoteric ideas to try, but I don't have any good working examples of clustering unknown faces in an incremental way.

dlib itself provides an implementation of the Chinese Whispers algorithm which can be used to cluster unknown faces fairly quickly using a graph data structure. In the graph, each face is a node and you connect each node to any other node where it's euclidean distance is less than 0.6 (or whatever threshold you choose).

Unfortunately dlib doesn't expose the underlying graph data structure from the python api so you can't add more nodes (faces) to it on-the-fly. You can do it from dlib's C++ api though. Alternately you could use a different implementation of Chinese Whispers in python that lets you access the graph and add nodes to the graph structure on-the-fly and implement your own version of it.

However at the end of the day I don't think that will actually be faster than your current approach because you still have to check the new face you add to the graph against every existing node to add it to the graph. But it would probably do a better job at determining the number of actually unique people in the dataset instead of just assuming the first match you find is the best match.

Sorry I don't have a more solid answer.

expenses commented 6 years ago

No, that's solid enough, thanks!

ageitgey commented 6 years ago

Sure thing!