[MMTEB] About `Mewsli-X`, `XQuAD-R` and contributing scores

izhx commented 5 months ago

I have code for the mMarco integration. Do you think MMTEB is happy for these machine translated data?

I can help to contribute scores of some models, they are from the following work, and I can also help with new tests.

Are Multilingual Autoregressive Language Models Good Universal Embedders?

izhx commented 5 months ago

~~Would you considering adding multilingual code embedding tasks? Such as CodeSearchNet~~

izhx commented 5 months ago

I also have code for Mewsli-X entity-linking-retrieval and XQuAD-R of LAReQA. They are part of XTREME-R.

They aim to retrieve answers from multilingual collection pool, which is different from the monolinugal tests of Miracl etc.

KennethEnevoldsen commented 5 months ago

We believe that there are plenty of good datasets that are not machine-translated. To avoid artifacts, we don't accept machine-translated datasets unless it has been validated.

Re Mewsli-X, it depends on the format. Assuming it is an entity in context retrieving correct description of the entity I believe it is reasonable.

izhx commented 5 months ago

We believe that there are plenty of good datasets that are not machine-translated. To avoid artifacts, we don't accept machine-translated datasets unless it has been validated.

I agree with that. Do we need to clean these mt data? such as STSBenchmarkMultilingualSTS, MMarcoRetrieval (from cmteb) .

izhx commented 5 months ago

Mewsli-X

Yes, it is to retrieve the correct entity description (text description from wikipedia) by the mention sentence (from WikiNews). We could evaluate it monolingually or cross-lingually (with some minor code patches to share the embeddings of multilingual pool).

Statistics (from its readme)

	Total	ar	de	en	es	fa	ja	pl	ro	ta	tr	uk
`dev`
Mentions (overall)	2,991	318	326	316	311	72	310	304	145	312	262	315
Mentions (cross-lingual)	2,285	275	210	214	231	68	177	202	127	306	226	249

`test`
Mentions (overall)	14,624	1,501	1,551	1,490	1,552	458	1,519	1,562	672	1,567	1,215	1,537
Mentions (cross-lingual)	10,967	1,313	1,023	1,009	1,082	416	834	1,014	601	1,510	1,004	1,161

Corpus Statistics (for the above languages) Description language	total	`ar`	`de`	`en`	`es`	`fa`	`ja`	`pl`	`ro`	`ta`	`tr`	`uk`
Entities	550,218	28,220	73,076	257,008	41,808	16,895	50,817	37,691	8,236	4,864	8,326	23,277

XNLI

It seems we don't yet have an abstract class for NLI tasks. XNLI is a popular and reliable multilingual NLI dataset. Perhaps it's worth including.

KennethEnevoldsen commented 5 months ago

I agree with that. Do we need to clean these mt data? such as STSBenchmarkMultilingualSTS, MMarcoRetrieval (from cmteb) .

It would be great if you would add the annotation in case it is machine translated, then we might remove it at the end of MMTEB (if we decide we want to change the chinese benchmark).

KennethEnevoldsen commented 5 months ago

Thanks for the wonderful stats @izhx. What would a cross-lingual mention mean in this case? Is it a text where the entities have multiple corresponding entries in different languages? The tasks would then be to retrieve all relevant entries regardless of language? Seems odd to me, I would imagine retrieving in the language of the text seems the most valid (but I might be missing something).

orionw commented 5 months ago

I was also wondering about cross-lingual content as it's a big topic for researchers around me -- however, there are so many directions of cross-lingual however, and most are typically English->XX or XX->English only.

It would definitely be nice to include, but it might be tricky to get decent coverage. Might be worth a broader discussion or just leaving them out entirely.

KennethEnevoldsen commented 5 months ago

I would love a broader discussion of that field. Helps us make the right decisions. Feel free to start a discussion thread.

orionw commented 5 months ago

Started in https://github.com/embeddings-benchmark/mteb/discussions/362

embeddings-benchmark / mteb

[MMTEB] About `Mewsli-X`, `XQuAD-R` and contributing scores #347

Mewsli-X

XNLI