feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities

KShivendu commented 1 year ago

Add a new dataset with OpenAI embeddings for DBpedia entities.

This PR also introduces HuggingFace datasets library which can help with loading any dataset on HuggingFace. This makes it easier for anyone to fork and benchmark against almost any public dataset :)

erikbern commented 1 year ago

Nice! Are there bigger ones btw? We have a few datasets already that are around 1M vectors so it might be interesting to try something larger (like 3-10M)

KShivendu commented 1 year ago

Are there bigger ones btw?

While working on this benchmark we didn't find any dataset with >=1536 dimensions. That's why I created one. We are planning to take this up to a 10M or even 100M scale in the upcoming weeks/months. I'll create PRs here when we do so :)

In the meantime, please note that this 1M dbpedia-entities dataset will take a lot of computing/RAM/time to run because of 1536 dimensions. One needs ~17GB RAM to run this with Qdrant and ~13GB for PGVector.

Also, I'll do the changes you suggested. Thanks for your quick response :D

erikbern commented 1 year ago

Nice, thanks! I'll run this and will add the dataset to S3

erikbern / ann-benchmarks

feat: Add new dataset with OpenAI embeddings for 1M DBpedia entities #434