Idea: Cohere Wikipedia Dataset

erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python

http://ann-benchmarks.com

MIT License

4.74k stars 718 forks source link

Open mmmaia opened 1 year ago

mmmaia commented 1 year ago

I believe the recently released Cohere's Wikipedia Embedding Archives could be a good addition to the benchmarks dataset.

It's note worth the multi language nature of the dataset.

Wikipedia	Number of vectors / embedded passages
English	35 million
German	15 million
French	13 million
Spanish	10 million
Italian	8 million
Japanese	5 million
Arabic	3 million
Chinese (Simplified)	2 million
Korean	1 million
Simple English	486 Thousand
Hindi	432 Thousand
Total	94 Million

erikbern commented 1 year ago

Good idea! Do you want to add it to https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/datasets.py (for English)? I'm about to run a new round of benchmarks so we could include that as one dataset.

mmmaia commented 1 year ago

I'm pretty new to this, so would probably take some time before getting it to work 😬

I may give it a try next week, if nobody does it.

erikbern commented 1 year ago

Ok no rush, I can also take a look at it. But you're very welcome to look at it too, if I don't have time to!