erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.73k stars 716 forks source link

Publish 1M embeddings benchmark #464

Open KShivendu opened 9 months ago

KShivendu commented 9 months ago

Hello again @erikbern, I hope you're doing well.

I was curious to know the progress on releasing the benchmarks on the 1M OpenAI embeddings dataset that I created for https://github.com/erikbern/ann-benchmarks/pull/434

We can use this issue to track the same. Let me know if I can help in any way :)

KShivendu commented 9 months ago

https://github.com/erikbern/ann-benchmarks/issues/460 will also fix this. So closing this issue. Thanks :)

KShivendu commented 8 months ago

@erikbern @maumueller Is there anything I can contribute to publish the 1M benchmark sooner? It'd really help my friends to see a larger dataset benchmark. I'm happy to help with the running and handling any errors for the benchmarks as well.

erikbern commented 8 months ago

Hi – planning to rerun all benchmarks at some point soon.

That being said, is the OpenAI dataset significantly different than previous datasets? I'm somewhat hesitant to use too many similar datasets – we already have a few ones that are similar size.

KShivendu commented 8 months ago

It's different in some specific ways:

  1. 1536 embeddings — most other datasets are at 384-512-768 embedding.
  2. 1M records is among the largest
  3. OpenAI embeddings are one of the most popular ones at the moment. Benchmarks for that will make ann-benchmarks.com even more useful for the community.
KShivendu commented 5 months ago

Hi @erikbern I hope you're doing well

I noticed that ann-benchmarks.com was last updated in Dec 2021 (2+ years). A lot has changed since then. I'm pretty sure there's a lot of value for the community if we update the website. I'm happy to spend some time running the benchmarks for you. Let me know if I can help :)