Closed phelogges closed 4 months ago
IndexIVFFlat with METRIC_INNER_PRODUCT works reasonably well when database vectors have nearby norms. Otherwise your mileage may vary.
IndexIVFFlat with METRIC_INNER_PRODUCT works reasonably well when database vectors have nearby norms. Otherwise your mileage may vary.
I have reproduced the problem with my data, all embeddings are l2 normed.
500 embeddings training, 1870 embeddings inserted, and 1 prob to search. There are only 1 embedding in 1870 comes from the same person with the prob but difference face images, others are unique ids, these two embeddings have inner product 0.569969, much greater than others.
under nlist = 5, nprob = 3, threshold = 0.1:
./ivf_searcher_test training_data.txt feature_data.txt 5 3 0.1
Got 500 training data
Got 1870 ids
Inserted 1870/1870 new features
Got 4 candidate
GT results 16, index search results 4
Print most top 10 results if possible
GT results top 10:
id: 1869 score: 0.569969
id: 685 score: 0.174677
id: 1231 score: 0.161352
id: 1324 score: 0.153520
id: 245 score: 0.134742
id: 175 score: 0.132821
id: 927 score: 0.123938
id: 191 score: 0.122099
id: 776 score: 0.119500
id: 809 score: 0.119447
Index search results top 4:
id: 927 score: 0.123938
id: 776 score: 0.119500
id: 809 score: 0.119447
id: 164 score: 0.116973
under nlist = 1, nprob = 1, threshold = 0.1:
./ivf_searcher_test training_data.txt feature_data.txt 1 1 0.1
Got 500 training data
Got 1870 ids
Inserted 1870/1870 new features
Got 16 candidate
GT results 16, index search results 16
Print most top 10 results if possible
GT results top 10:
id: 1869 score: 0.569969
id: 685 score: 0.174677
id: 1231 score: 0.161352
id: 1324 score: 0.153520
id: 245 score: 0.134742
id: 175 score: 0.132821
id: 927 score: 0.123938
id: 191 score: 0.122099
id: 776 score: 0.119500
id: 809 score: 0.119447
Index search results top 10:
id: 1869 score: 0.569969
id: 685 score: 0.174677
id: 1231 score: 0.161352
id: 1324 score: 0.153520
id: 245 score: 0.134742
id: 175 score: 0.132821
id: 927 score: 0.123938
id: 191 score: 0.122099
id: 776 score: 0.119500
id: 809 score: 0.119447
check this link to get code and data
Summary
I'm working on face recognition and going to use faiss as my embedding searching library. And I want to use IndexIVFFlat (with inner product) to get faster searching speed.
But according to my experience, I found IndexIVFFlat often failed to retrieve correct embedding.
Platform
OS:
Faiss version: 1.7.4
Installed from: source
Faiss compilation options: with MKL compile flag
Running on:
Interface:
Reproduction instructions
Please forgive me if I'm too verbose :( I'm working on face recognition and going to use faiss as my embedding searching library. And I want to use IndexIVFFlat (with inner product) to get faster searching speed. But according to my experience, I found IndexIVFFlat often failed to retrieve correct embedding. For example I added an embedding from a face of one person into index, then I got this person's another face embedding as prob, this two embedding inner product showed 0.625, it was quite high (each embedding is 512 dims unit vector). But when I used this prob to search index, it gave highest distance only about 0.1, it came from another person, index gave a wrong searching result. In detail, I got 1k id's embedding as training data, and set nlist 10, nprob 5. When searching, I added another new 2k ids into index, then added an new embedding mentioned above, in this index, there were 2001 ids's embedding, then I searched index with the new prob also mentioned above by method range_search, distance 0.35, finally index gave a wrong result. But when I set nlist 1 and nprob 1, the index gave a correct result. I thought in this setting, IndexIVFFlat would work like IndexFlatIP, cause only one cluster and you must choose one cluster as candidate. Here's my thinking, IndexIVFFlat use cluster algorithm to decrease searching space, but face embedding is not suitable for cluster. Because face embedding just assuming each embedding has larger inter-class distance (smaller inner product) and smaller intra-class distance (larger inner product). So when we train index with some embedding, it will get several cluster centers, but when a new id's embedding come to add, from this embedding's view, it has almost same distances to every center, cause they are inter-class, probably 0.1, 0.11, 0.12.., then index set it into 0.12 cluster. Then when we do searching, another embedding but comes from same id, this prob has distance to each center probably 0.12, 0.11, 0.1..., and index choose closet 2 cluster, this will lead the added embedding not in candidate clusters. Or we could say cluster algorithm is not working for face embedding. So here's my questions: 1. For IndexIVFFlat (metric inner product), when set nlist 1 and nprob 1, will it work like IndexFlatIP as I mentioned in paragraph 6? 2. Following face embedding assuming larger inter-class distance and smaller intra-class distance, does IndexIVFFlat (metric inner product) not work for face embedding? 3. If Q2 is yes, then for face searching problem, is there other searching solution which could speed up searching than flat searching?