Closed ljstrnadiii closed 5 years ago
You are only looking in a single IVF list, as nprobe
is by default 1.
Increase nprobe
rather than k
.
of course, merci beaucoup!
I did want to ask about the typical strategy to split your datasets. In some examples I have noticed that you build an xb, xt, xq dataset: one for training, one for adding and the last for query (equivalent to a test set). I am not sure what is the typical split for this field. Do you usually train on xt, add [xt, xb] (or does xb already contain xt?) to the index, and search with xt? It is hard to tell how you have constructed your memmap files. What proportion of the whole dataset is xq, xt and xb typically?
thanks for such a killer project!
Summary
I am using embeddings computed from the popular FaceNet model. I have calculate about 2.5M embeddings in d=512 and am looking at performance of the
IndexIVFFlat
compared to the simpleFlat
index. Even with largek
I see flat results in the recallRunning on:
Interface:
Reproduction instructions
Notice how the recall is not increasing as k increases.
I have tried many ,, between 4096 to 20000 and I do not see any improvement.
Questions:
Is it possible that the data distribution is not conducive to this method?
Am I possibly splitting my query and training set incorrectly?