Closed KShivendu closed 1 year ago
Nice! Are there bigger ones btw? We have a few datasets already that are around 1M vectors so it might be interesting to try something larger (like 3-10M)
Are there bigger ones btw?
While working on this benchmark we didn't find any dataset with >=1536 dimensions. That's why I created one. We are planning to take this up to a 10M or even 100M scale in the upcoming weeks/months. I'll create PRs here when we do so :)
In the meantime, please note that this 1M dbpedia-entities
dataset will take a lot of computing/RAM/time to run because of 1536 dimensions. One needs ~17GB RAM to run this with Qdrant and ~13GB for PGVector.
Also, I'll do the changes you suggested. Thanks for your quick response :D
Nice, thanks! I'll run this and will add the dataset to S3
Add a new dataset with OpenAI embeddings for DBpedia entities.
This PR also introduces HuggingFace datasets library which can help with loading any dataset on HuggingFace. This makes it easier for anyone to fork and benchmark against almost any public dataset :)