Open rom1504 opened 2 years ago
the second idea indeed seem it will work, see https://github.com/facebookresearch/faiss/blob/main/benchs/distributed_ondisk/merge_to_ondisk.py
it seems possible to merge an ivf index without loading it in memory
that means it should be possible to create N index parts in parallel (even in only one machine to parallelize reading) which will take N (size of the part + loaded embedding) in memory (for example 4 1GB) and then to merge at the end. That will reduce the overall memory requirement from currently size of the index + add overhead (eg 20GB) to only the ongoing parts (eg 4GB) and allow creating very large indices at constant memory usage
combining with #17 it should allow creating such large indices at low memory usage.
edit: that script doesn't do exactly that, it doesn't merge the codes. Would need something slightly different for this use case
fyi @hitchhicker (related to what we discussed)
The production of an index in distributed mode is now done (see https://github.com/criteo/autofaiss/pull/71)
Follow ups here could be :
I'm pretty sure that SliceInvertedLists (https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/InvertedLists.h#L272 https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/InvertedLists.cpp#L321) can be used to create partitioned indices for a big index at no additional memory/disk cost (ie loading inverted lists only once for both the big index and the partitioned indices)
some info at https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk and https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors and https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/OnDiskInvertedLists.cpp