erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.74k stars 718 forks source link

Include number of vectors in the db for benchmarking #435

Closed KShivendu closed 1 year ago

KShivendu commented 1 year ago

Hi @erikbern, We recently benchmarked the performance of two popular open-source databases: Qdrant and Postgres (Pgvector) using the ann-benchmarks repo. It became quite popular on Twitter. We think the number of vectors can in the DB is also an important factor when comparing performance. For example, have a look at our results

I did the benchmark against 100k, 200k, ..., and 1M vectors in the DB but had to split the 1M dataset into separate datasets to make it work with the repo (See this) and then I created a script to iterate through them and run them one by one.

I feel these changes are a bit hacky and we can make them fit more naturally with the code so that it's reproducible and easier to work with. I'd love to hear your thoughts if you think there's a cleaner way to implement this. Happy to raise a PR for the final design that we agree upon :)

Thanks!

erikbern commented 1 year ago

We have a few different datasets with different number of vectors so maybe that covers the variation to some extent? There's a lot of parameters that affect the performance – eg dimensionality, how the data is distributed, etc

KShivendu commented 1 year ago

Yeah indeed. So should I create multiple splits from the same dbpedia-entities-1M dataset so that we have variations for 100k to 1M (each step being 100k)? I'll update the https://github.com/erikbern/ann-benchmarks/pull/434 PR accordingly.

KShivendu commented 1 year ago

Updated the #434 PR to have multiple splits since the glove-angular dataset was already doing the same

erikbern commented 1 year ago

Tbh instead of varying this parameter, I would rather just see a larger diversity of datasets. I think that would create more "independent" data points which seems useful.