embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.81k stars 240 forks source link

[MMTEB] About `Mewsli-X`, `XQuAD-R` and contributing scores #347

Closed izhx closed 5 months ago

izhx commented 5 months ago

I have code for the mMarco integration. Do you think MMTEB is happy for these machine translated data?

I can help to contribute scores of some models, they are from the following work, and I can also help with new tests.

Are Multilingual Autoregressive Language Models Good Universal Embedders?

izhx commented 5 months ago

Would you considering adding multilingual code embedding tasks? Such as CodeSearchNet

izhx commented 5 months ago

I also have code for Mewsli-X entity-linking-retrieval and XQuAD-R of LAReQA. They are part of XTREME-R.

They aim to retrieve answers from multilingual collection pool, which is different from the monolinugal tests of Miracl etc.

KennethEnevoldsen commented 5 months ago

We believe that there are plenty of good datasets that are not machine-translated. To avoid artifacts, we don't accept machine-translated datasets unless it has been validated.

Re Mewsli-X, it depends on the format. Assuming it is an entity in context retrieving correct description of the entity I believe it is reasonable.

izhx commented 5 months ago

We believe that there are plenty of good datasets that are not machine-translated. To avoid artifacts, we don't accept machine-translated datasets unless it has been validated.

I agree with that. Do we need to clean these mt data? such as STSBenchmarkMultilingualSTS, MMarcoRetrieval (from cmteb) .

izhx commented 5 months ago

Mewsli-X

Yes, it is to retrieve the correct entity description (text description from wikipedia) by the mention sentence (from WikiNews). We could evaluate it monolingually or cross-lingually (with some minor code patches to share the embeddings of multilingual pool).

Statistics (from its readme)

  Total ar de en es fa ja pl ro ta tr uk
dev
Mentions (overall) 2,991 318 326 316 311 72 310 304 145 312 262 315
Mentions (cross-lingual) 2,285 275 210 214 231 68 177 202 127 306 226 249
 
test
Mentions (overall) 14,624 1,501 1,551 1,490 1,552 458 1,519 1,562 672 1,567 1,215 1,537
Mentions (cross-lingual) 10,967 1,313 1,023 1,009 1,082 416 834 1,014 601 1,510 1,004 1,161
Corpus Statistics (for the above languages) Description language total ar de en es fa ja pl ro ta tr uk
Entities 550,218 28,220 73,076 257,008 41,808 16,895 50,817 37,691 8,236 4,864 8,326 23,277

XNLI

It seems we don't yet have an abstract class for NLI tasks. XNLI is a popular and reliable multilingual NLI dataset. Perhaps it's worth including.

KennethEnevoldsen commented 5 months ago

I agree with that. Do we need to clean these mt data? such as STSBenchmarkMultilingualSTS, MMarcoRetrieval (from cmteb) .

It would be great if you would add the annotation in case it is machine translated, then we might remove it at the end of MMTEB (if we decide we want to change the chinese benchmark).

KennethEnevoldsen commented 5 months ago

Thanks for the wonderful stats @izhx. What would a cross-lingual mention mean in this case? Is it a text where the entities have multiple corresponding entries in different languages? The tasks would then be to retrieve all relevant entries regardless of language? Seems odd to me, I would imagine retrieving in the language of the text seems the most valid (but I might be missing something).

orionw commented 5 months ago

I was also wondering about cross-lingual content as it's a big topic for researchers around me -- however, there are so many directions of cross-lingual however, and most are typically English->XX or XX->English only.

It would definitely be nice to include, but it might be tricky to get decent coverage. Might be worth a broader discussion or just leaving them out entirely.

KennethEnevoldsen commented 5 months ago

I would love a broader discussion of that field. Helps us make the right decisions. Feel free to start a discussion thread.

orionw commented 5 months ago

Started in https://github.com/embeddings-benchmark/mteb/discussions/362