embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.88k stars 252 forks source link

Some datasets for languages. #419

Closed x-tabdeveloping closed 1 month ago

x-tabdeveloping commented 6 months ago

I'm gonna practice drums for the rest of the day and probably won't work tomorrow, but for those who are looking to contribute and get some of those juicy points here is some low-hanging fruit in diverse languages:

Slovak:

Greek:

Maltese:

dokato commented 5 months ago

I'm gonna pick up kiviki/SlovakSum if noone is on it yet.

dokato commented 5 months ago

On the other hand it seems like the summary task requires:

        human_summaries: list[str]
        machine_summaries: list[str]
        relevance: list[float] (the score of the machine generated summaries)

and kiviki/SlovakSum doesn't have neither machine_summaries nor relevance scores.

x-tabdeveloping commented 5 months ago

@dokato Try formulating it as a retrieval task instead :))

wissam-sib commented 5 months ago

I can start working on the Maltese datasets if no one is

x-tabdeveloping commented 5 months ago

@wissam-sib Please verify that no one has added them yet or is working on a PR, otherwise feel free to go ahead :D

wissam-sib commented 5 months ago

News categories is being added so I'm gonna go for the NLI one

mariyahendriksen commented 4 months ago

I will take care of Greek medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek

KennethEnevoldsen commented 1 month ago

Will close this issue for now - I assume many of these are still relevant to add if so we should probably create separate PRs for these.

@mariyahendriksen do you still want to add the greek medical QA?