embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.61k stars 211 forks source link

Add a European Benchmark #361

Open KennethEnevoldsen opened 2 months ago

KennethEnevoldsen commented 2 months ago

Add a benchmark for the European languages. This issue gives an overview of the status.

The EU has 24 official languages:

Germanic Languages

Romance Languages

Slavic Languages

Baltic Languages

Finno-Ugric Languages

Other Indo-European Languages

Non-Indo-European Languages

Additionally to this list we might add (feel free to add languages that I might have missed):

Note I haven't checked off languages only covered in bitext tasks or in translated tasks

x-tabdeveloping commented 2 months ago

We might get reasonable coverage of Macedonian by including Bulgarian, they are as close as Bokmål and Danish pretty much. Same thing goes for Serbian and Croatian (maybe also Slovenian) up until to nineties Serbo-Croatian was considered done language. Perhaps the reason to include both would be because Serbian is written with cyrillic script (but then we can make sure the model understands that by having Bulgarian, Russian or Ukrainian)

x-tabdeveloping commented 2 months ago

Also might consider some minority languages and regional dialects for fairness's and ethics's sake, a handful of examples I can think of:

isaac-chung commented 2 months ago

Something from EURLEX like https://huggingface.co/datasets/coastalcph/multi_eurlex or https://huggingface.co/datasets/ddrg/super_eurlex is promising. Though for the classification task MTEB doesn't seem to support mutli-label classification, right?

KennethEnevoldsen commented 2 months ago

No sadly it does not support it (would love a PR on it though).

PierreMesure commented 2 months ago

Great job, Kenneth et al.! We might be able to contribute with some Swedish data later this Spring. We embed several types of Swedish administrative documents to enable semantic search in them and we're planning on improving our evaluation pipeline this year. Will try to contribute back.

I'm not from Sápmi but I think it's the right place to mention that Sami languages should be included at some points. I haven't heard of any Nordic LLM initiative regarding these languages but I hope one emerges soon with a strong link to the people still using them.

KennethEnevoldsen commented 2 months ago

Thanks for the addition, @PierreMesure; if you know of any Sami dataset, feel free to add them or create an issue so that other contributors can add them in.

Rysias commented 2 months ago

I think with my recent PR that Albanian and Latvian has some representation - at least for a clustering task