facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.78k stars 3.59k forks source link

I found an issue during the IndexIVFPQ query process in version 1.7.2, and I'm not sure if it's a bug. I hope you can help me solve it. #3019

Open yangshubin2023 opened 1 year ago

yangshubin2023 commented 1 year ago

Summary

//.........

int main() {
    int d = 64;      // dimension
    int nb = 100000; // database size
    int nq = 10000;  // nb of queries

    std::mt19937 rng;
    std::uniform_real_distribution<> distrib;

    float* xb = new float[d * nb];
    float* xq = new float[d * nq];

    // .......

    int nlist = 100;
    int k = 4;
    int m = 8;                       // bytes per vector
    faiss::IndexFlatIP quantizer(d); // the other index
    faiss::IndexIVFPQ index(&quantizer, d, nlist, m, 8, METRIC_INNER_PRODUCT);

    index.train(nb, xb);
    index.add(nb, xb);

    { // sanity check
        idx_t* I = new idx_t[k * 5];
        float* D = new float[k * 5];

        index.search(5, xb, k, D, I);

        //........

        delete[] I;
        delete[] D;
    }

    delete[] xb;
    delete[] xq;

    return 0;
}

Platform

Operating System: Ubuntu 20.04.3 LTS Kernel: Linux 5.4.0-122-generic Architecture: x86-64

Running on:

Interface:

Reproduction instructions

After reading the Feiss code, I found that IndexIVFPQ used residual calculation during both training and data addition processes. The residual data is used to calculate fine-grained centroids, and the fine-grained centroid data is also stored as residual data.

However, during the query process, the query vector x was not subjected to residual processing and compared with the fine-grained centroid to calculate the distance. Is this correct?

In my understanding, the query quantity x should also be calculated based on the residual vector x ', and then use x' and fine-grained centroid comparison techniques based on distance to make sense.

Comparing the fine-grained centroids formed by the original vector x and residual data makes me a bit confused.

I hope you can help me answer it. Thank you~

mdouze commented 1 year ago

You are correct that the query vector is also compared based on the residual vectors. What makes you think this is not the case?