facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.34k stars 3.64k forks source link

Small question on HNSW parameter #2268

Closed DaehanKim closed 3 months ago

DaehanKim commented 2 years ago

Summary

What is the meaning of x in HNSWx,Flat? I'm confused of its space footprint 4*d + x * M * 2 * 4. I know M is the maximum number of degree on each layer and d is the dimensionality of vectors to index. But I could not get the meaning of x. Is it a number of layers that the algorithm is producing?

KinglittleQ commented 2 years ago

x in HNSWx,Flat is actually M (the maximum degrees for each layer except the lowest layer which is 2M).

>>> import faiss
>>> index = faiss.index_factory(128, "HNSW32,Flat")
>>> index.hnsw.nb_neighbors(1)
32
>>> index.hnsw.nb_neighbors(0)
64
DaehanKim commented 2 years ago

@KinglittleQ Thank you for the reply. Then is the space footprint rendered 4*d + x * M * 2 * 4 = 4*d + (M^2) * 2 * 4?

lchen-p commented 4 months ago

This does not seems to be correct. With M=32, one vector will use 4 * d + 32 * 32 * 2 * 4 = (4d + 8192) bytes. If d is small enough (less than 200), it can be ignored in this equation. Then 1M vectors will cost 8192 * 1e6 bytes. So more than 8 GB?

lchen-p commented 4 months ago

Ok, I read the paper: "Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs". In chapter 4.2.3 it listed memory cost for index is (Mmax0+mL∙Mmax)∙bytes_per_link.

bytes_per_link is 4, Mmaxis M, Mmax0 is max neighbors in bottom layer, so 2M, mL is 1/ln(M). So the index should cost `4 (2 + 1/ln(M)) Mbytes. With4 dfor the data, the final equation is4 d + 4 (2 + 1/ln(M)) * M` bytes/vector.

If we ignore the term 1/ln(M) ("only" 0.29 for M=32), we get 4 * d + 4 * 2 * M, x would be 1. Otherwise to be precise, x would be (2 + 1/ln(M)) / 2.

Though I am not sure how close the implementation is according to the paper.