embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.78k stars 233 forks source link

Adding False Friends Dataset to German MTEB #331

Open achibb opened 4 months ago

achibb commented 4 months ago

Hi all,

dataset: https://huggingface.co/datasets/aari1995/false_friends_en_de

associated paper

this dataset can be used to test german and especially multilingual models for their understanding capabilities of "false-friends". false friends are words that sound the same / spelled similarly but mean different in different languages.

Example false friend: german: boot (boat) english: boot (shoe)

The evaluation is the following:

input sentence: ich lebe auf einem boot (i live on a boat) true_synonym: ich lebe auf einem schiff (i live on a ship) false_friend: ich lebe auf einem schuh (i live on a shoe)

multilingual models mostly perform worse, bad ones even worse than chance, which indicates they are primarily good for english and prefer english representations.

May I add it to the german MTEB tasks and if so, what should I consider ? Is it a reranking task?

Thanks and all the best Aaron

KennethEnevoldsen commented 4 months ago

Hi @achibb we very much welcome contributions.

There is a guide to adding a dataset here. Generally, we want a new dataset to cover something previously not covered by the benchmark. This indeed sound like such a case.

To me, it seems like a pair classification task where the pairs would be as follows:

(input sentence, true_synonym, label=positive) (input sentence, false_friend, label=negative)

KranthiGV commented 4 months ago

I'm interested and would like to offer any help with this. Happy to collaborate or take on any tasks needed. Let me know!

KennethEnevoldsen commented 4 months ago

Wonderful @KranthiGV! I believe there are some potential good additions in the Aya and OpenAssistant datasets (large multilingual). If you start working on them please create an issue so that others can see that you are working on it. Alternatively, if you are less interested in dataset creation and more interested in improving the codebase then we also have a few tasks there.

@achibb do you have the time to add this dataset?

achibb commented 4 months ago

Yes it is on the plan for today or latest next week. Is this fine or does this intervene with some release ?

Gesendet von Outlook für iOShttps://aka.ms/o0ukef


Von: Kenneth Enevoldsen @.> Gesendet: Friday, April 12, 2024 11:32:45 AM An: embeddings-benchmark/mteb @.> Cc: achibb @.>; Mention @.> Betreff: Re: [embeddings-benchmark/mteb] Adding False Friends Dataset to German MTEB (Issue #331)

Wonderful @KranthiGVhttps://github.com/KranthiGV! I believe there are some potential good additions in the Aya and OpenAssistant datasets (large multilingual). If you start working on them please create an issue so that others can see that you are working on it. Alternatively, if you are less interested in dataset creation and more interested in improving the codebase then we also have a few tasks there.

@achibbhttps://github.com/achibb do you have the time to add this dataset?

— Reply to this email directly, view it on GitHubhttps://github.com/embeddings-benchmark/mteb/issues/331#issuecomment-2051397283, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKBF2KS6THNV43AZSHHALTDY46S33AVCNFSM6AAAAABF6SFYGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGM4TOMRYGM. You are receiving this because you were mentioned.Message ID: @.***>

achibb commented 4 months ago

The PR for false friends as pair classification model:

https://github.com/embeddings-benchmark/mteb/pull/349

tests so far look aligned with the research, even though the method is slightly edited. What do you think? Anything i need to do else? :) Thinking about doing such tests for other languages as well.

KennethEnevoldsen commented 4 months ago

Thanks @achibb - Will respond to the above in the PR

achibb commented 4 months ago

Nice thanks @KennethEnevoldsen I adjusted it except for two points where I have not found a use of the metadata in any file I looked. So if given some feedback I would work it in but would also be fine for me to go with the current status. Also excited to see what @Muennighoff thinks in including it on run_mteb_german.py and benchmarks as I see this as a quite nice opportunity for adversarial testing :) Enjoy the weekend guys!

KennethEnevoldsen commented 4 months ago

Yeah metadata in the old datasets is not great (we are working on it!), but the documentation for adding a new dataset have some examples.

Love the adversarial testing part! Would love to see more datasets like this one

achibb commented 4 months ago

Yeah no worries :).

Yeah totally. I am also trying to contribute and am currently forming a group for multilingual stuff at hf.co/multilingual

feel free to join anyone who wants and has time and would love to see some more languages creating false friends tests.

KennethEnevoldsen commented 4 months ago

Ahh that is great. Is it targeted at adversarial-style tests?