Open ashvardanian opened 1 year ago
Thanks for adding this! Yeah I'm definitely open to other datasets – I believe there are a few issues/PRs with similar questions (reminds me I really need to go through this – sorry for being so behind)
As I know, ann-benchmarks works with specific dataset format (i.e h5df), so converting is required. By the way, multimodal cc3m (bigg clip image and text embeddings) and arxiv title-abstract (e5 embeddings) can be good options
I also find those to be exciting datasets. I have used both of them when comparing the quality of cross-modal "semantic joins" with uni-modal embeddings in one of my recent posts. Converting to HDF5 shouldn't be an issue at all.
@erikbern, can you please check the conflict with algos.yaml
? I don't have permission to resolve the conflict in this UI, can only try rebasing from the upstream.
@ashvardanian Please rebase your PR against master
yourself. You are using an outdated directory structure.
Your implementation is supposed to go into ann_benchmarks/algorithms/usearch/
.
With regard to new datasets, it would be great if you could run the benchmark on these datasets and share plots. More diversity is appreciated, also with regard to the observed performance/quality-tradeoff. Thanks!
@maumueller sounds good! Let's split the work into 2 halves. First, merge USearch, then - new datasets. As for USearch, I am considering a few options for the config.yml
and am unsure of the best way to structure it.
float:
any:
- base_args: ['@metric']
constructor: USearch
disabled: false
docker_tag: ann-benchmarks-usearch
module: ann_benchmarks.algorithms.usearch
name: usearch
run_groups:
M-12:
arg_groups: [{M: 12, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-16:
arg_groups: [{M: 16, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-24:
arg_groups: [{M: 24, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-36:
arg_groups: [{M: 36, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]`
M-4:
arg_groups: [{M: 4, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-48:
arg_groups: [{M: 48, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-64:
arg_groups: [{M: 64, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-8:
arg_groups: [{M: 8, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
M-96:
arg_groups: [{M: 96, efConstruction: 500}]
args: {}
query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
The current variant is identical to HNSWlib. We, however, also support f16
and i8
internal downcasting. What would be the best way to expose such functionality?
If you want to run them through the benchmark, they should probably be exposed through a parameter in args
. (E.g., [['f32', 'f16', 'i8']]
.) Please make sure to include the information in the __str__
method of your wrapper class to expose it in the evaluation.
However, note that we run a rather strict timelimit of 2 hours for building + querying. Afterwards, the container will just be killed and you might not get the runs carried out that gave you the best performance.
@ashvardanian What is the status on this issue?
Happy to merge this if you want to rebase!
Hey, @erikbern and @maumueller! Thank you for following up!
As of right now, I'm in the active development phase of USearch v3 and UCall v1, hoping to finish the first by the end of the month. Having USearch in the ann-benchmarks
would be great, but I'm unsure if I have time to rebase the current implementation before the v3 release.
That said, even the current implementation of HNSW looks highly competitive, especially after 50 Million vectors, but it seems to make more sense for the big-ann-benchmarks
.
Hi @ashvardanian! This looks interesting, but it seems indeed more suitable for the big-ann-benchmarks
project. Of course, it would also be interesting in the million-scale that we focus on here :-)
@maumueller, I was also thinking about the big-ann-benchmarks
, but this year they've shrunk the dataset size from 1 Billion down to tens of Millions... We have a new result coming up for 20 Billion vectors, and I have entirely no clue as to where to publish it 😅
As for datasets, the Arxiv Titles and Abstracts encoded with E5, suggested by @kimihailv are indeed a great option! I've recently used them to build a scientific RAG, so they are pretty representative of the embeddings actually used in production. I guess we can use the glove-25-angular.hdf5
as a reference, right?
>>> f = h5py.File('/Users/av/Downloads/glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f = h5py.File('glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f["train"].shape
(1183514, 25)
>>> f["distances"].shape
(10000, 100)
@ashvardanian : the results look good. IIUC the underlying implementation is hnsw , could you please compare the results with nmslib/hnswlib and faiss/hnsw implementation .
Thanks for open-sourcing this benchmark, @erikbern! We have had this fork for a while, and it feels like a good time to contribute.
A couple of things to note:
f8
, recall drops to zero on most datasets included in the benchmark. It is different from embeddings produced with most modern encoder Transformers. It depends on the properties of inserted vectors.Addressing the first issue, would it make sense to extend the list of datasets with other embeddings? We have a few such datasets for ANN benchmarks on our HuggingFace profile.