erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.88k stars 735 forks source link

Add: USearch engine #451

Open ashvardanian opened 1 year ago

ashvardanian commented 1 year ago

Thanks for open-sourcing this benchmark, @erikbern! We have had this fork for a while, and it feels like a good time to contribute.

A couple of things to note:

  1. When you choose f8, recall drops to zero on most datasets included in the benchmark. It is different from embeddings produced with most modern encoder Transformers. It depends on the properties of inserted vectors.
  2. We are planning a 1.0 release soon, which may require changing a couple of characters in the following line:
return self._index.search(np.expand_dims(v, axis=0), k=n)[0][0]

Addressing the first issue, would it make sense to extend the list of datasets with other embeddings? We have a few such datasets for ANN benchmarks on our HuggingFace profile.

erikbern commented 1 year ago

Thanks for adding this! Yeah I'm definitely open to other datasets – I believe there are a few issues/PRs with similar questions (reminds me I really need to go through this – sorry for being so behind)

kimihailv commented 1 year ago

As I know, ann-benchmarks works with specific dataset format (i.e h5df), so converting is required. By the way, multimodal cc3m (bigg clip image and text embeddings) and arxiv title-abstract (e5 embeddings) can be good options

ashvardanian commented 1 year ago

I also find those to be exciting datasets. I have used both of them when comparing the quality of cross-modal "semantic joins" with uni-modal embeddings in one of my recent posts. Converting to HDF5 shouldn't be an issue at all.

@erikbern, can you please check the conflict with algos.yaml? I don't have permission to resolve the conflict in this UI, can only try rebasing from the upstream.

maumueller commented 1 year ago

@ashvardanian Please rebase your PR against master yourself. You are using an outdated directory structure.

Your implementation is supposed to go into ann_benchmarks/algorithms/usearch/.

maumueller commented 1 year ago

With regard to new datasets, it would be great if you could run the benchmark on these datasets and share plots. More diversity is appreciated, also with regard to the observed performance/quality-tradeoff. Thanks!

ashvardanian commented 1 year ago

@maumueller sounds good! Let's split the work into 2 halves. First, merge USearch, then - new datasets. As for USearch, I am considering a few options for the config.yml and am unsure of the best way to structure it.

float:
  any:
  - base_args: ['@metric']
    constructor: USearch
    disabled: false
    docker_tag: ann-benchmarks-usearch
    module: ann_benchmarks.algorithms.usearch
    name: usearch
    run_groups:
      M-12:
        arg_groups: [{M: 12, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-16:
        arg_groups: [{M: 16, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-24:
        arg_groups: [{M: 24, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-36:
        arg_groups: [{M: 36, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]`
      M-4:
        arg_groups: [{M: 4, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-48:
        arg_groups: [{M: 48, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-64:
        arg_groups: [{M: 64, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-8:
        arg_groups: [{M: 8, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-96:
        arg_groups: [{M: 96, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]

The current variant is identical to HNSWlib. We, however, also support f16 and i8 internal downcasting. What would be the best way to expose such functionality?

maumueller commented 1 year ago

If you want to run them through the benchmark, they should probably be exposed through a parameter in args. (E.g., [['f32', 'f16', 'i8']].) Please make sure to include the information in the __str__ method of your wrapper class to expose it in the evaluation.

However, note that we run a rather strict timelimit of 2 hours for building + querying. Afterwards, the container will just be killed and you might not get the runs carried out that gave you the best performance.

maumueller commented 1 year ago

@ashvardanian What is the status on this issue?

erikbern commented 11 months ago

Happy to merge this if you want to rebase!

ashvardanian commented 11 months ago

Hey, @erikbern and @maumueller! Thank you for following up!

As of right now, I'm in the active development phase of USearch v3 and UCall v1, hoping to finish the first by the end of the month. Having USearch in the ann-benchmarks would be great, but I'm unsure if I have time to rebase the current implementation before the v3 release.

search_speed recall construction_speed construction_memory

That said, even the current implementation of HNSW looks highly competitive, especially after 50 Million vectors, but it seems to make more sense for the big-ann-benchmarks.

maumueller commented 11 months ago

Hi @ashvardanian! This looks interesting, but it seems indeed more suitable for the big-ann-benchmarks project. Of course, it would also be interesting in the million-scale that we focus on here :-)

ashvardanian commented 11 months ago

@maumueller, I was also thinking about the big-ann-benchmarks, but this year they've shrunk the dataset size from 1 Billion down to tens of Millions... We have a new result coming up for 20 Billion vectors, and I have entirely no clue as to where to publish it 😅

As for datasets, the Arxiv Titles and Abstracts encoded with E5, suggested by @kimihailv are indeed a great option! I've recently used them to build a scientific RAG, so they are pretty representative of the embeddings actually used in production. I guess we can use the glove-25-angular.hdf5 as a reference, right?

>>> f = h5py.File('/Users/av/Downloads/glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f = h5py.File('glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f["train"].shape
(1183514, 25)
>>> f["distances"].shape
(10000, 100)
patelprateek commented 11 months ago

@ashvardanian : the results look good. IIUC the underlying implementation is hnsw , could you please compare the results with nmslib/hnswlib and faiss/hnsw implementation .