ekzhu / datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://ekzhu.github.io/datasketch
MIT License
2.51k stars 294 forks source link

HNSW: `HNSW.add` will not set the entry point of new levels #228

Closed amirouche closed 11 months ago

amirouche commented 11 months ago

I am getting into vector space, so take what I write with a grain of salt 👋🏽

There is no fixed entry / enter point in HNSW, not even at the highest layer / level:

I can't find the algorithm in the paper that documents how to insert the first point. Tho, the insert algorithm reads that the associated level $l$ where it appears first in the HNSW layers is not necessarily the highest level $L$:

For every inserted element an integer maximum layer l is randomly [chosen].

Quoting HNSW:

After that, the algorithm continues the search from the next layer using the found closest neighbors from the previous layer as enter points, and the process repeats.

https://arxiv.org/pdf/1603.09320.pdf

The last line is what triggered me to fill the issue.

https://github.com/ekzhu/datasketch/blob/c98c145417965d015103f2a1bfcf59658e90b470/datasketch/hnsw.py#L533-L537

The last line self._entrypoint = key can be done outside the for but see my previous comments.

ekzhu commented 11 months ago

Thanks. Discussion on the PR.

amirouche commented 11 months ago

Give there is low number of points at the highest level, randomizing at every query could help. Another kind index for the highest layer may be even more interesting in some use-cases.

I am more interested in bigger than memory indices.

Closing this issue.