embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.9k stars 255 forks source link

Question about Adding Datasets #802

Closed Ruqyai closed 5 months ago

Ruqyai commented 5 months ago

Question:

I don't know if I am allowed to add a dataset that is not my own work. In the first submission, I collected the data myself through web scraping.

However, when I browsed the task folder, I found that the Arabic language is almost limited to classification. There are several large datasets like this one. Can I add it now?

https://huggingface.co/datasets/Cohere/miracl-ar-corpus-22-12

Or others from the list: https://huggingface.co/datasets?language=language:ar&sort=trending

imenelydiaker commented 5 months ago

Hello,

Yes you can add datasets that are sourced and are not your own.

For this one, we already handle MIRACL with all languages, see this PR #642.

Feel free to add any other dataset from the list, just make sure it is of good quality.

Ruqyai commented 5 months ago

Thank you a lot @imenelydiaker

KennethEnevoldsen commented 5 months ago

This issue seem resolved. Will close it for now. Feel free to reopen it