Some datasets for languages.

x-tabdeveloping commented 6 months ago

I'm gonna practice drums for the rest of the day and probably won't work tomorrow, but for those who are looking to contribute and get some of those juicy points here is some low-hanging fruit in diverse languages:

Slovak:

~~Sentiment: https://huggingface.co/datasets/sepidmnorozy/Slovak_sentiment~~ (as a matter of fact she has loads of Sentiment classification datasets: https://huggingface.co/sepidmnorozy)
News Summarization: https://huggingface.co/datasets/kiviki/SlovakSum

Greek:

~~Legal code clustering: https://huggingface.co/datasets/AI-team-UoA/greek_legal_code~~
NLI: https://huggingface.co/datasets/Harsit/xnli2.0_greek
Medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek

Maltese:

News titles: https://huggingface.co/datasets/MLRS/maltese_news_headlines
News categories: https://huggingface.co/datasets/MLRS/maltese_news_categories

dokato commented 5 months ago

I'm gonna pick up kiviki/SlovakSum if noone is on it yet.

dokato commented 5 months ago

On the other hand it seems like the summary task requires:

        human_summaries: list[str]
        machine_summaries: list[str]
        relevance: list[float] (the score of the machine generated summaries)

and kiviki/SlovakSum doesn't have neither machine_summaries nor relevance scores.

x-tabdeveloping commented 5 months ago

@dokato Try formulating it as a retrieval task instead :))

wissam-sib commented 5 months ago

I can start working on the Maltese datasets if no one is

x-tabdeveloping commented 5 months ago

@wissam-sib Please verify that no one has added them yet or is working on a PR, otherwise feel free to go ahead :D

wissam-sib commented 5 months ago

News categories is being added so I'm gonna go for the NLI one

mariyahendriksen commented 4 months ago

I will take care of Greek medical QA: https://huggingface.co/datasets/ilsp/medical_mcqa_greek

KennethEnevoldsen commented 1 month ago

Will close this issue for now - I assume many of these are still relevant to add if so we should probably create separate PRs for these.

@mariyahendriksen do you still want to add the greek medical QA?

embeddings-benchmark / mteb