ljstrnadiii commented 5 years ago

Summary

I am using embeddings computed from the popular FaceNet model. I have calculate about 2.5M embeddings in d=512 and am looking at performance of the IndexIVFFlat compared to the simple Flat index. Even with large k I see flat results in the recall

Running on:

[ ] GPU

Interface:

[ ] Python

Reproduction instructions

xb = np.ascontiguousarray(X[::2][:2*1000*1000])
xq = np.ascontiguousarray(X[1::2][:10*1000])
d = xq.shape[1]

# compute gt
flat_index = faiss.index_factory(d, "Flat")
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, 0, flat_index, None)
flat_index.train(xb)
flat_index.add(xb)
D, gt = flat_index.search(xq, k)

# try an approximate method
index = faiss.index_factory(d, "IVF<n_centroids>,Flat")
res = faiss.StandardGpuResources()
index = faiss.index_cpu_to_gpu(res, 0, index, None)
index.train(xb)
index.add(xb)

def evaluate(index, xq, gt, k):
    nq = xq.shape[0]
    t0 = time.time()
    D, I = index.search(xq, k)  # noqa: E741
    t1 = time.time()
    recalls = {}
    i = 1
    while i <= k:
        recalls[i] = (I[:, :i] == gt[:, :1]).sum() / float(nq)
        i *= 10

    return (t1 - t0) * 1000.0 / nq, recalls

evaluate(flat_index, xq, gt, 1000)
>>
(2.1849388122558593, 
 {1: 0.99850000000000005, 
  10: 1.0, 
  100: 1.0, 
  1000: 1.0})

evaluate(index, xq, gt, 1000)

>>
(0.038869810104370114,
 {1: 0.35210000000000002,
  10: 0.35289999999999999,
  100: 0.35289999999999999,
  1000: 0.35299999999999998})

Notice how the recall is not increasing as k increases.

I have tried many ,, between 4096 to 20000 and I do not see any improvement.

Questions:

Is it possible that the data distribution is not conducive to this method?
Am I possibly splitting my query and training set incorrectly?

wickedfoo commented 5 years ago

You are only looking in a single IVF list, as nprobe is by default 1.

Increase nprobe rather than k.

ljstrnadiii commented 5 years ago

of course, merci beaucoup!

I did want to ask about the typical strategy to split your datasets. In some examples I have noticed that you build an xb, xt, xq dataset: one for training, one for adding and the last for query (equivalent to a test set). I am not sure what is the typical split for this field. Do you usually train on xt, add [xt, xb] (or does xb already contain xt?) to the index, and search with xt? It is hard to tell how you have constructed your memmap files. What proportion of the whole dataset is xq, xt and xb typically?

thanks for such a killer project!

facebookresearch / faiss

IndexIVFFlat on 2M embeddings from FaceNet is giving poor results #1001

Summary

Reproduction instructions

Questions: