Open jaanli opened 2 months ago
In case further motivation is needed, here are the types of algorithms I need to benchmark: https://github.com/google-deepmind/xtr - the FAISS parts are here: https://github.com/google-deepmind/xtr/blob/main/xtr_evaluation_on_beir_miracl.ipynb
ds = 128
num_clusters = 50
code_size = 64
quantizer = faiss.IndexFlatIP(ds)
opq_matrix = faiss.OPQMatrix(ds, code_size)
opq_matrix.niter = 10
sub_index = faiss.IndexIVFPQ(quantizer, ds, num_clusters, code_size, 4, faiss.METRIC_INNER_PRODUCT)
index = faiss.IndexPreTransform(opq_matrix, sub_index)
index.train(all_token_embeds[:num_tokens])
index.add(all_token_embeds[:num_tokens])
class FaissSearcher(object):
def __init__(self, index):
self.index = index
def search_batched(self, query_embeds, final_num_neighbors, **kwargs):
scores, top_ids = self.index.search(query_embeds, final_num_neighbors)
return top_ids, scores
self.searcher = FaissSearcher(index)
Such a benchmark would be super helpful to decide which in-browser use cases are flexible enough :)
https://github.com/nmslib/hnswlib
For example, I have a few databases ready to go:
20 years of census data - https://jaanli.github.io/american-community-survey/new-york-area/income-by-race 15 million hospital claims - https://onefact.github.io/synthetic-healthcare-data/ All of NYC real estate - https://jaanli.github.io/new-york-real-estate/
And I really want to visualize the 30,000+ Mandarin characters by their phono-semantic specificity/etymological origins on a map.
All of these require high-dimensional similarity search, but are of very different scale. So the UI/UX interactions (e.g. very early ones from 2017 here: https://jaan.io/food2vec-augmented-cooking-machine-intelligence/) will be constrained by the queries per second supported in this duckdb extension.
Hope that makes sense, and happy to help! 🙏 super exciting that this is now feasible!!