embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.63k stars 212 forks source link

Add a Benchmark for Asian Languages #367

Open KennethEnevoldsen opened 2 months ago

KennethEnevoldsen commented 2 months ago

Linguistic Families and Proposed Languages:

East Asian Languages

South Asian Languages

Indic Languages:

Southeast Asian Languages

Central Asian Languages

West Asian (Middle Eastern) Languages

Note this list does not claim to be comprehensive, do feel free to add to the list.

rasdani commented 2 months ago

I will take a stab at a Bengali benchmark together with a colleague of mine đź‘Ť

KennethEnevoldsen commented 2 months ago

Wonderful @rasdani feel free to create an issue on this as well so that others can see that you are working on it.

gentaiscool commented 3 weeks ago

I created PRs for Indonesian languages (at least 10+ additions from 2 corpus) and African language. Once, they are approved, I can add the languages to the list.