Add a European Benchmark

KennethEnevoldsen commented 2 months ago

Add a benchmark for the European languages. This issue gives an overview of the status.

The EU has 24 official languages:

Germanic Languages

[x] Danish - dan
[x] English - eng
[x] German - deu
[x] Dutch - nld
[x] Swedish - swe

Romance Languages

[x] French - fra
[x] Italian - ita
[x] Portuguese - por
[x] Spanish - spa
[x] Romanian - ron

Slavic Languages

[x] Bulgarian - bul
[x] Croatian - hrv
[x] Czech - ces
[x] Polish - pol
[x] Slovak - slk
[x] Slovenian - slv

Baltic Languages

[x] Latvian - lav
[x] Lithuanian - lit

Finno-Ugric Languages

[x] Estonian - est (2 tasks)
[x] Finnish - fin (covered with only 1 task outside of bitext and translations)
[x] Hungarian - hun (1 task)

Other Indo-European Languages

[x] Greek - ell

Non-Indo-European Languages

[x] Maltese - mlt (Semitic Language)
[x] Irish - gle (Celtic Language)

Additionally to this list we might add (feel free to add languages that I might have missed):

[x] Basque (although it is not an official EU language, it is officially recognized within Spain)
[x] Norwegian Nynorsk (official language of Norway, which is in the Schengen Area but not in the EU)
[x] Norwegian Bokmål (official language of Norway, which is in the Schengen Area but not in the EU)
[x] Icelandic (official language of Iceland, also in the Schengen Area but not in the EU)
[x] Albanian (official language of Albania, a candidate for EU membership and part of the Schengen visa regime)
[x] Serbian (official language of Serbia, another candidate for EU membership and part of Schengen visa policies)
[x] Macedonian (official language of North Macedonia, an EU candidate country and part of the Schengen visa system)
[x] Romani (recognized minority language in numerous European countries)

Note I haven't checked off languages only covered in bitext tasks or in translated tasks

x-tabdeveloping commented 2 months ago

We might get reasonable coverage of Macedonian by including Bulgarian, they are as close as Bokmål and Danish pretty much. Same thing goes for Serbian and Croatian (maybe also Slovenian) up until to nineties Serbo-Croatian was considered done language. Perhaps the reason to include both would be because Serbian is written with cyrillic script (but then we can make sure the model understands that by having Bulgarian, Russian or Ukrainian)

x-tabdeveloping commented 2 months ago

Also might consider some minority languages and regional dialects for fairness's and ethics's sake, a handful of examples I can think of:

Romani is a minority language in a lot of European countries, and Romas make up a sizeable portion of the population. I might be able to scramble some resources as the Hungarian Roma community is quite prevalent.
Schwiizerdütsch as it is very different from mainstream Hochdeutsch
Sønderjysk, we can hopefully find some resources for that
Frisian, as about half a million people have it as their mother tongue

isaac-chung commented 2 months ago

Something from EURLEX like https://huggingface.co/datasets/coastalcph/multi_eurlex or https://huggingface.co/datasets/ddrg/super_eurlex is promising. Though for the classification task MTEB doesn't seem to support mutli-label classification, right?

KennethEnevoldsen commented 2 months ago

No sadly it does not support it (would love a PR on it though).

PierreMesure commented 2 months ago

Great job, Kenneth et al.! We might be able to contribute with some Swedish data later this Spring. We embed several types of Swedish administrative documents to enable semantic search in them and we're planning on improving our evaluation pipeline this year. Will try to contribute back.

I'm not from Sápmi but I think it's the right place to mention that Sami languages should be included at some points. I haven't heard of any Nordic LLM initiative regarding these languages but I hope one emerges soon with a strong link to the people still using them.

KennethEnevoldsen commented 2 months ago

Thanks for the addition, @PierreMesure; if you know of any Sami dataset, feel free to add them or create an issue so that other contributors can add them in.

Rysias commented 2 months ago

I think with my recent PR that Albanian and Latvian has some representation - at least for a clustering task

embeddings-benchmark / mteb