criteo / autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.
https://criteo.github.io/autofaiss/
Apache License 2.0
801 stars 74 forks source link

[Bug?] Index retrieval is not self-consistent. #176

Open ysig opened 10 months ago

ysig commented 10 months ago

Hi,

I had very bad scores using the autofaiss build_index method and except that the bug could have been caused from the fact that when you build_index you don't save the fname that corresponds to each index ... I also noticed something utterly strange. In particular, the index is not self-consistent. That means if I use the same set of embeddings that I used to build an index to do retrieval I don't always get the same point (yet the reconstruction error is zero when building the index).

import numpy as np
from autofaiss import build_index

data_path = 'images'
features_dir = 'features'

files = [f for f in os.listdir(data_path)]
full = [join(path, f) for f in os.listdir(data_path)]
index = build_index(embeddings=np.concatenate([np.load(f) for f in full], axis=0), nb_cores=12, save_on_disk=False)[0]

how_many = 0
for f in os.listdir(features_dir):
    query_vector = np.load(join(features_dir, f))
    distances, indices = index.search(query_vector, 1)
    distances, indices = np.squeeze(distances), np.squeeze(indices)
    how_many += int(files[indices[0]] != f)

print(which_many)

Which outputs for my data around 178 (out of 1000000 points).

Is this a bug - or do I need different input parameters to make this work properly?

I use faiss==1.7.4 and autofaiss==2.15.8.

Thank you, ysig

rom1504 commented 10 months ago

What is fname ? What do you call self-consistent ?

Approximate knn index do not have a top1 accuracy of 100%. If you need that, use an exact index instead. With only 1M points it can make sense if you don't need too much speed.

If you do need speed, then you'll have to live with less than 100% of accuracy and indeed you can tweak the option to pick something that works for you (read the readme for this)

On Tue, Oct 24, 2023 at 12:44 PM ysig @.***> wrote:

Hi,

I had very bad scores using the autofaiss build_index method and except that the bug could have been caused from the fact that when you build_index you don't save the fname that corresponds to each index ... I also noticed something utterly strange. In particular, the index is not self-consistent. That means if I use the same set of embeddings that I used to build an index to do retrieval I don't always get the same point (yet the reconstruction error is zero when building the index).

import numpy as npfrom autofaiss import build_index data_path = 'images'features_dir = 'features' files = [f for f in os.listdir(data_path)]full = [join(path, f) for f in os.listdir(data_path)]index = build_index(embeddings=np.concatenate([np.load(f) for f in full], axis=0), nb_cores=12, save_on_disk=False)[0] how_many = 0for f in os.listdir(features_dir): query_vector = np.load(join(features_dir, f)) neighbors = [] distances, indices = index.search(query_vector, 1) distances, indices = np.squeeze(distances), np.squeeze(indices) how_many += int(files[indices[0]] != f) print(which_many)

Which outputs for my data around 178 (out of 1000000 points).

Is this a bug - or do I need different input parameters to make this work properly?

I use faiss==1.7.4 and autofaiss==2.15.8.

Thank you, ysig

— Reply to this email directly, view it on GitHub https://github.com/criteo/autofaiss/issues/176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QJNKRVC47NJDDX2B3YA6LZDAVCNFSM6AAAAAA6NPQR66VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE2TQOJYGEZTKNQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ysig commented 10 months ago

Hi @rom1504 and thanks for your answer.

fname is file name. self-consistent means top-1 accuracy = 100%

Well for nearest neighbors I don't think it makes sense to not have 100% accuracy otherwise I'm not sure it make sense to use it? I need reasonable speed, I have 5M data points. What should I use?

rom1504 commented 10 months ago

there is no need to save file names. The files are read in alphabetical order

"Well for nearest neighbors I don't think it makes sense to not have 100% accuracy" it does make a lot of sense. Almost no use case requires 100% accuracy. Usually a lower accuracy for the knn part does not even impact the overall quality of the system.

What is reasonable speed? query speed of 1min 1s 1ms ?

if you want perfect knn accuracy (note that says nothing on your system quality) you can use a faiss flat index

I encourage you to read some more on approximate knn first, there are some links in the readme

ysig commented 10 months ago

@rom1504 Thanks again for your detailed answer!

There is no need to save file names. The files are read in alphabetical order:

Now my purpose/scope it to measure embedding knn accuracy for a baseline on a certain task. It's for a research paper and the result needs to be scientifically precise.

Flat index will be very slow however:

I have around 500,000 test points and around 5M train points split in 5 indexes so given that my budget then is B = 5* T * 500000 if I set my budget B to one day, i.e. B = 86400 seconds then T = 86400 / (5*500000) = 0.03456 s, i.e. I need queries that cost <= 0.03456 + ε seconds.

Could you propose something?

Thank you,