Closed izhx closed 5 months ago
Would you considering adding multilingual code embedding tasks? Such as CodeSearchNet
We believe that there are plenty of good datasets that are not machine-translated. To avoid artifacts, we don't accept machine-translated datasets unless it has been validated.
Re Mewsli-X, it depends on the format. Assuming it is an entity in context retrieving correct description of the entity I believe it is reasonable.
We believe that there are plenty of good datasets that are not machine-translated. To avoid artifacts, we don't accept machine-translated datasets unless it has been validated.
I agree with that. Do we need to clean these mt data? such as STSBenchmarkMultilingualSTS
, MMarcoRetrieval
(from cmteb) .
Yes, it is to retrieve the correct entity description (text description from wikipedia) by the mention sentence (from WikiNews). We could evaluate it monolingually or cross-lingually (with some minor code patches to share the embeddings of multilingual pool).
Statistics (from its readme)
Total | ar | de | en | es | fa | ja | pl | ro | ta | tr | uk | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
dev |
||||||||||||
Mentions (overall) | 2,991 | 318 | 326 | 316 | 311 | 72 | 310 | 304 | 145 | 312 | 262 | 315 |
Mentions (cross-lingual) | 2,285 | 275 | 210 | 214 | 231 | 68 | 177 | 202 | 127 | 306 | 226 | 249 |
test |
||||||||||||
Mentions (overall) | 14,624 | 1,501 | 1,551 | 1,490 | 1,552 | 458 | 1,519 | 1,562 | 672 | 1,567 | 1,215 | 1,537 |
Mentions (cross-lingual) | 10,967 | 1,313 | 1,023 | 1,009 | 1,082 | 416 | 834 | 1,014 | 601 | 1,510 | 1,004 | 1,161 |
Corpus Statistics (for the above languages) Description language | total | ar |
de |
en |
es |
fa |
ja |
pl |
ro |
ta |
tr |
uk |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Entities | 550,218 | 28,220 | 73,076 | 257,008 | 41,808 | 16,895 | 50,817 | 37,691 | 8,236 | 4,864 | 8,326 | 23,277 |
It seems we don't yet have an abstract class for NLI tasks. XNLI
is a popular and reliable multilingual NLI dataset. Perhaps it's worth including.
I agree with that. Do we need to clean these mt data? such as STSBenchmarkMultilingualSTS, MMarcoRetrieval (from cmteb) .
It would be great if you would add the annotation in case it is machine translated, then we might remove it at the end of MMTEB (if we decide we want to change the chinese benchmark).
Thanks for the wonderful stats @izhx. What would a cross-lingual mention mean in this case? Is it a text where the entities have multiple corresponding entries in different languages? The tasks would then be to retrieve all relevant entries regardless of language? Seems odd to me, I would imagine retrieving in the language of the text seems the most valid (but I might be missing something).
I was also wondering about cross-lingual content as it's a big topic for researchers around me -- however, there are so many directions of cross-lingual however, and most are typically English->XX or XX->English only.
It would definitely be nice to include, but it might be tricky to get decent coverage. Might be worth a broader discussion or just leaving them out entirely.
I would love a broader discussion of that field. Helps us make the right decisions. Feel free to start a discussion thread.
I have code for the mMarco integration. Do you think MMTEB is happy for these machine translated data?
I can help to contribute scores of some models, they are from the following work, and I can also help with new tests.
Are Multilingual Autoregressive Language Models Good Universal Embedders?