facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.41k stars 3.64k forks source link

Does faiss IndexIVFFlat work for face searching #3105

Closed phelogges closed 4 months ago

phelogges commented 1 year ago

Summary

I'm working on face recognition and going to use faiss as my embedding searching library. And I want to use IndexIVFFlat (with inner product) to get faster searching speed.

But according to my experience, I found IndexIVFFlat often failed to retrieve correct embedding.

Platform

OS:

Faiss version: 1.7.4

Installed from: source

Faiss compilation options: with MKL compile flag

Running on:

Interface:

Reproduction instructions

Please forgive me if I'm too verbose :( I'm working on face recognition and going to use faiss as my embedding searching library. And I want to use IndexIVFFlat (with inner product) to get faster searching speed. But according to my experience, I found IndexIVFFlat often failed to retrieve correct embedding. For example I added an embedding from a face of one person into index, then I got this person's another face embedding as prob, this two embedding inner product showed 0.625, it was quite high (each embedding is 512 dims unit vector). But when I used this prob to search index, it gave highest distance only about 0.1, it came from another person, index gave a wrong searching result. In detail, I got 1k id's embedding as training data, and set nlist 10, nprob 5. When searching, I added another new 2k ids into index, then added an new embedding mentioned above, in this index, there were 2001 ids's embedding, then I searched index with the new prob also mentioned above by method range_search, distance 0.35, finally index gave a wrong result. But when I set nlist 1 and nprob 1, the index gave a correct result. I thought in this setting, IndexIVFFlat would work like IndexFlatIP, cause only one cluster and you must choose one cluster as candidate. Here's my thinking, IndexIVFFlat use cluster algorithm to decrease searching space, but face embedding is not suitable for cluster. Because face embedding just assuming each embedding has larger inter-class distance (smaller inner product) and smaller intra-class distance (larger inner product). So when we train index with some embedding, it will get several cluster centers, but when a new id's embedding come to add, from this embedding's view, it has almost same distances to every center, cause they are inter-class, probably 0.1, 0.11, 0.12.., then index set it into 0.12 cluster. Then when we do searching, another embedding but comes from same id, this prob has distance to each center probably 0.12, 0.11, 0.1..., and index choose closet 2 cluster, this will lead the added embedding not in candidate clusters. Or we could say cluster algorithm is not working for face embedding. So here's my questions: 1. For IndexIVFFlat (metric inner product), when set nlist 1 and nprob 1, will it work like IndexFlatIP as I mentioned in paragraph 6? 2. Following face embedding assuming larger inter-class distance and smaller intra-class distance, does IndexIVFFlat (metric inner product) not work for face embedding? 3. If Q2 is yes, then for face searching problem, is there other searching solution which could speed up searching than flat searching?
mdouze commented 1 year ago

IndexIVFFlat with METRIC_INNER_PRODUCT works reasonably well when database vectors have nearby norms. Otherwise your mileage may vary.

phelogges commented 1 year ago

IndexIVFFlat with METRIC_INNER_PRODUCT works reasonably well when database vectors have nearby norms. Otherwise your mileage may vary.

I have reproduced the problem with my data, all embeddings are l2 normed.

500 embeddings training, 1870 embeddings inserted, and 1 prob to search. There are only 1 embedding in 1870 comes from the same person with the prob but difference face images, others are unique ids, these two embeddings have inner product 0.569969, much greater than others.

under nlist = 5, nprob = 3, threshold = 0.1:

./ivf_searcher_test training_data.txt feature_data.txt 5 3 0.1
Got 500 training data
Got 1870 ids
Inserted 1870/1870 new features
Got 4 candidate
GT results 16, index search results 4
Print most top 10 results if possible
GT results top 10:
    id: 1869    score: 0.569969
    id: 685 score: 0.174677
    id: 1231    score: 0.161352
    id: 1324    score: 0.153520
    id: 245 score: 0.134742
    id: 175 score: 0.132821
    id: 927 score: 0.123938
    id: 191 score: 0.122099
    id: 776 score: 0.119500
    id: 809 score: 0.119447
Index search results top 4:
    id: 927 score: 0.123938
    id: 776 score: 0.119500
    id: 809 score: 0.119447
    id: 164 score: 0.116973

under nlist = 1, nprob = 1, threshold = 0.1:

./ivf_searcher_test training_data.txt feature_data.txt 1 1 0.1
Got 500 training data
Got 1870 ids
Inserted 1870/1870 new features
Got 16 candidate
GT results 16, index search results 16
Print most top 10 results if possible
GT results top 10:
    id: 1869    score: 0.569969
    id: 685 score: 0.174677
    id: 1231    score: 0.161352
    id: 1324    score: 0.153520
    id: 245 score: 0.134742
    id: 175 score: 0.132821
    id: 927 score: 0.123938
    id: 191 score: 0.122099
    id: 776 score: 0.119500
    id: 809 score: 0.119447
Index search results top 10:
    id: 1869    score: 0.569969
    id: 685 score: 0.174677
    id: 1231    score: 0.161352
    id: 1324    score: 0.153520
    id: 245 score: 0.134742
    id: 175 score: 0.132821
    id: 927 score: 0.123938
    id: 191 score: 0.122099
    id: 776 score: 0.119500
    id: 809 score: 0.119447

check this link to get code and data