Benchmark against FAISS & nmslib?

duckdb / duckdb_vss

MIT License

53 stars 6 forks source link

Such a benchmark would be super helpful to decide which in-browser use cases are flexible enough :)

For example, I have a few databases ready to go:

20 years of census data - https://jaanli.github.io/american-community-survey/new-york-area/income-by-race 15 million hospital claims - https://onefact.github.io/synthetic-healthcare-data/ All of NYC real estate - https://jaanli.github.io/new-york-real-estate/

And I really want to visualize the 30,000+ Mandarin characters by their phono-semantic specificity/etymological origins on a map.

All of these require high-dimensional similarity search, but are of very different scale. So the UI/UX interactions (e.g. very early ones from 2017 here: https://jaan.io/food2vec-augmented-cooking-machine-intelligence/) will be constrained by the queries per second supported in this duckdb extension.

Hope that makes sense, and happy to help! 🙏 super exciting that this is now feasible!!

ds = 128 num_clusters = 50 code_size = 64 quantizer = faiss.IndexFlatIP(ds) opq_matrix = faiss.OPQMatrix(ds, code_size) opq_matrix.niter = 10 sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 4, faiss.METRIC_INNER_PRODUCT) index = faiss.IndexPreTransform(opq_matrix, sub_index) index.train(all_token_embeds[:num_tokens]) index.add(all_token_embeds[:num_tokens]) class FaissSearcher(object): def __init__(self, index): self.index = index def search_batched(self, query_embeds, final_num_neighbors, **kwargs): scores, top_ids = self.index.search(query_embeds, final_num_neighbors) return top_ids, scores self.searcher = FaissSearcher(index)

duckdb / duckdb_vss

Benchmark against FAISS & nmslib? #4