Find list of optimizations to add

Duckierstone42 / HNSW-Mathematics-Paper

This repo contains the code for our implementation of the HNSW algorithm (Hierarchical Navigable Small Worlds) for our CS 2051 Math paper.

0 stars 1 forks source link

Find list of optimizations to add #2

Open eoriont opened 10 months ago

eoriont commented 10 months ago

Basically google some papers on HNSW optimizations, then copy the link and post them here w/ a short summary of them

Duckierstone42 commented 10 months ago

Here is a link https://escholarship.org/content/qt1rp889r9/qt1rp889r9_noSplash_7071690a1d8a4ee71eb95432887d3a8e.pdf to a paper that contains various optimizations for the HNSW algorithm. We can look at this for inspiration.

Duckierstone42 commented 10 months ago

Here is a list of potential optimizations to add for the Hierarchical Navigable Small Worlds Algorithm, optimizations on the base algorithm.

Parallel Computing: The HNSW algorithm can be optimized by implementing parallel computing, as shown in this thesis paper here. It would be hard however to explain this and talk about it in a mathematical manner.

Distance Metric: We could improve the HNSW algorithm by using different distance metrics than just Euclidean Distance. We could explore different distance metrics like Manhattan distance, Cosine distance, chebyshev distance, minkowski distance, and compare the runtimes between them. This is probably a good change/optimization to explore on the HNSW algorithm, because we can use math to easily explain each of these metrics. Here is a blog post that gives you a basic explanation of each of these.

Duckierstone42 commented 10 months ago

Here is another distance optimization, I highly doubt that we would actually be able to implement this, because it appears to be very complicated and we can't implement it by just swapping out the previous distance function. We can however talk about it in future extensions, like other potential optimizations to explore in the future. This distance optimization works by simply approximating the distance, which is fine because this is an approximate algorithm. It is discussed in this paper

Duckierstone42 commented 10 months ago

https://esteininger.medium.com/building-a-vector-search-engine-using-hnsw-and-cosine-similarity-753fb5268839 Here is another resource I used, make sure to add it in the references section

byoseph3 commented 10 months ago

Got partially through adding the Cosine Distance to the overleaf document. (line 111) Need to brush up on the other distance methods we thought about and have employed so far. Finished adding the references though.

As for advantages and disadvantages, are there any obvious ones that stood out when running those trials? Only thing I managed to see was the O(log(n)) time complexity, which doesn't seem specific to Cosine Distance. Not sure how checking how similar the vectors are in terms of angles would make the algorithm more accurate either, though I haven't been able to dedicate too much time on it.

Just to clarify: Cosine is Cosine and l2 is Euclidean, correct? If so, what is ip distance, and will we be using Manhattan distance?

byoseph3 commented 10 months ago

Alright, ignore the previous comment. I got through explaining each one of them and their formulas, though I need to think about advantages and disadvantages. I got some help from this, so the author might have some useful stuff here too. https://medium.com/@junxie2/semantic-search-similarity-metrics-e215a0d65885

eoriont commented 10 months ago

https://github.com/erikbern/ann-benchmarks

Here is a list of KNN algorithms, all benchmarked for us. We could probably cite these in the optimizations.

eoriont commented 10 months ago

https://publications.hse.ru/mirror/pubs/share/folder/x5p6h7thif/direct/128296059 (HNSW paper) https://arxiv.org/pdf/1603.09320.pdf (another HNSW paper) https://github.com/brtholomy/hnsw (HNSW explanation) https://www.nature.com/articles/30918 (Small worlds) https://ethen8181.github.io/machine-learning/deep_learning/multi_label/nsw.html (NSW tutorial) https://fasttext.cc/docs/en/python-module.html#train_supervised-parameters (docs for ngram model used in nsw tut)

eoriont commented 10 months ago

https://en.wikipedia.org/wiki/Nearest_neighbor_search (NNS) https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm (KNN) https://towardsdatascience.com/similarity-search-part-4-hierarchical-navigable-small-world-hnsw-2aad4fe87d37 (HNSW tutorial again)

byoseph3 commented 10 months ago

Quick question, where did you find out that Inner Product might have a closer vector than itself? Was it through just working with the HNSW implementation we have or was there a source to it?

eoriont commented 10 months ago

https://www3.nd.edu/~dgalvin1/60610/60610_S09/60610graphnotation.pdf (graph notation)