duckdb / duckdb_vss

MIT License
53 stars 6 forks source link

Benchmark against FAISS & nmslib? #4

Open jaanli opened 2 months ago

jaanli commented 2 months ago

Such a benchmark would be super helpful to decide which in-browser use cases are flexible enough :)

https://github.com/nmslib/hnswlib

For example, I have a few databases ready to go:

20 years of census data - https://jaanli.github.io/american-community-survey/new-york-area/income-by-race 15 million hospital claims - https://onefact.github.io/synthetic-healthcare-data/ All of NYC real estate - https://jaanli.github.io/new-york-real-estate/

And I really want to visualize the 30,000+ Mandarin characters by their phono-semantic specificity/etymological origins on a map.

All of these require high-dimensional similarity search, but are of very different scale. So the UI/UX interactions (e.g. very early ones from 2017 here: https://jaan.io/food2vec-augmented-cooking-machine-intelligence/) will be constrained by the queries per second supported in this duckdb extension.

Hope that makes sense, and happy to help! 🙏 super exciting that this is now feasible!!

jaanli commented 2 months ago

In case further motivation is needed, here are the types of algorithms I need to benchmark: https://github.com/google-deepmind/xtr - the FAISS parts are here: https://github.com/google-deepmind/xtr/blob/main/xtr_evaluation_on_beir_miracl.ipynb

            ds = 128
            num_clusters = 50
            code_size = 64
            quantizer = faiss.IndexFlatIP(ds)
            opq_matrix = faiss.OPQMatrix(ds, code_size)
            opq_matrix.niter = 10
            sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 4, faiss.METRIC_INNER_PRODUCT)
            index = faiss.IndexPreTransform(opq_matrix, sub_index)
            index.train(all_token_embeds[:num_tokens])
            index.add(all_token_embeds[:num_tokens])
            class FaissSearcher(object):
                def __init__(self, index):
                    self.index = index
                def search_batched(self, query_embeds, final_num_neighbors, **kwargs):
                    scores, top_ids = self.index.search(query_embeds, final_num_neighbors)
                    return top_ids, scores
            self.searcher = FaissSearcher(index)