erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.74k stars 718 forks source link

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434

Closed KShivendu closed 1 year ago

KShivendu commented 1 year ago

Add a new dataset with OpenAI embeddings for DBpedia entities.

This PR also introduces HuggingFace datasets library which can help with loading any dataset on HuggingFace. This makes it easier for anyone to fork and benchmark against almost any public dataset :)

erikbern commented 1 year ago

Nice! Are there bigger ones btw? We have a few datasets already that are around 1M vectors so it might be interesting to try something larger (like 3-10M)

KShivendu commented 1 year ago

Are there bigger ones btw?

While working on this benchmark we didn't find any dataset with >=1536 dimensions. That's why I created one. We are planning to take this up to a 10M or even 100M scale in the upcoming weeks/months. I'll create PRs here when we do so :)

In the meantime, please note that this 1M dbpedia-entities dataset will take a lot of computing/RAM/time to run because of 1536 dimensions. One needs ~17GB RAM to run this with Qdrant and ~13GB for PGVector.

Also, I'll do the changes you suggested. Thanks for your quick response :D

erikbern commented 1 year ago

Nice, thanks! I'll run this and will add the dataset to S3