erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.74k stars 718 forks source link

Idea: Cohere Wikipedia Dataset #393

Open mmmaia opened 1 year ago

mmmaia commented 1 year ago

I believe the recently released Cohere's Wikipedia Embedding Archives could be a good addition to the benchmarks dataset.

It's note worth the multi language nature of the dataset.


Wikipedia Number of vectors / embedded passages
English 35 million
German 15 million
French 13 million
Spanish 10 million
Italian 8 million
Japanese 5 million
Arabic 3 million
Chinese (Simplified) 2 million
Korean 1 million
Simple English 486 Thousand
Hindi 432 Thousand
Total 94 Million
erikbern commented 1 year ago

Good idea! Do you want to add it to https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/datasets.py (for English)? I'm about to run a new round of benchmarks so we could include that as one dataset.

mmmaia commented 1 year ago

I'm pretty new to this, so would probably take some time before getting it to work 😬

I may give it a try next week, if nobody does it.

erikbern commented 1 year ago

Ok no rush, I can also take a look at it. But you're very welcome to look at it too, if I don't have time to!