facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
30.78k stars 3.59k forks source link

Recommended values for centroids, bytes per code, nprobe for more accurate results ? #975

Closed unmeshvrije closed 4 years ago

unmeshvrije commented 5 years ago

Greetings,

I am using the following code to build the indexes (the code inspired from tutorials) Are these values optimal for getting the best results (in terms of accuracy ?)

    int ncentroids = int(4 * sqrt(nb));  // total # of centroids
    int bytesPerCode = 4; // d must be multiple of this
    int bitsPerSubcode = 8;
    if (d % 10 == 0) {
        bytesPerCode = d /10;
    }
    faiss::IndexFlatL2 quantizer(d);
    faiss::IndexIVFPQ index(&quantizer, d, ncentroids, bytesPerCode, bitsPerSubcode);

    index.verbose = true;
    float float_emb[nt * d];
    for (long i2 = 0; i2 < nt; ++i2) {
        double *emb = E->get(i2);
        for (uint16_t j2 = 0; j2 < d; ++j2) {
            float_emb[d * i2 + j2] = (float)emb[j2];
        }
        float_emb[d * i2] += i2 / 1000.;
    }
    index.train(nt, float_emb);
    index.add(nt, float_emb);
    index.nprobe = 16;
beauby commented 5 years ago

The optimal parameters depend on your dataset/use case. If you want perfect accuracy and storage space/search time is not an issue, you should use IndexFlatL2 alone.

unmeshvrije commented 5 years ago

Thank you @beauby for the answer. Here is what I am trying to do: I have Knowledge graph (KG) embeddings and I am using TransE method. For a KG of a university, a triple (student1, studies, ?) is a query for which I aim to predict the tail (head is student1 and relation is studies). TransE model assigns embeddings (a vector of N dimensions) to all entities and relations such that for a true triple (student1, studies, subject1) , student1 + studies is almost equal to subject1 where bold letters denote the embedding of the corresponding entity/relation.

I want to compare the performance of TransE with Approximate nearest neighbour and thus, want to use parameters that give the most accurate results. I was not sure whether changing centroids, nprobe would affect accuracy.

Please let me know if the use case is not clear yet

mdouze commented 5 years ago

See https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

mdouze commented 4 years ago

no activity, closing.