embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.98k stars 276 forks source link

Add Voyage multilingual datasets #920

Open Muennighoff opened 5 months ago

Muennighoff commented 5 months ago

Lots of multilingual datasets listed here https://docs.google.com/spreadsheets/d/1qf0iYejG-9RgEEi13qB_SK_178-eNaeJDmSDNSj260A/edit?gid=1875159366#gid=1875159366 from https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/ ; I imagine some of them are not in MTEB yet; would be great to have them 🙌

KennethEnevoldsen commented 5 months ago

I know some of these are already covered, and some of them I can't seem to find (dan_news_summ_test). Do we have more references on these?

For convenience here is a list (I have not checked all in this list):

FRENCH

(@imenelydiaker can you have a look at these)

GERMAN

JAPANESE

KOREAN

SPANISH

OTHER

ENGLISH