criteo / autofaiss

Automatically create Faiss knn indices with the most optimal similarity search parameters.
https://criteo.github.io/autofaiss/
Apache License 2.0
806 stars 74 forks source link

multi index ideas #29

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

some info at https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk and https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors and https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/OnDiskInvertedLists.cpp

rom1504 commented 2 years ago

the second idea indeed seem it will work, see https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/merge_to_ondisk.py

it seems possible to merge an ivf index without loading it in memory

that means it should be possible to create N index parts in parallel (even in only one machine to parallelize reading) which will take N (size of the part + loaded embedding) in memory (for example 4 1GB) and then to merge at the end. That will reduce the overall memory requirement from currently size of the index + add overhead (eg 20GB) to only the ongoing parts (eg 4GB) and allow creating very large indices at constant memory usage

combining with #17 it should allow creating such large indices at low memory usage.

edit: that script doesn't do exactly that, it doesn't merge the codes. Would need something slightly different for this use case

rom1504 commented 2 years ago

fyi @hitchhicker (related to what we discussed)

rom1504 commented 2 years ago

The production of an index in distributed mode is now done (see https://github.com/criteo/autofaiss/pull/71)

Follow ups here could be :

rom1504 commented 2 years ago

I'm pretty sure that SliceInvertedLists (https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/InvertedLists.h#L272 https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/InvertedLists.cpp#L321) can be used to create partitioned indices for a big index at no additional memory/disk cost (ie loading inverted lists only once for both the big index and the partitioned indices)