Average scores on multilingual tasks?

turian commented 1 year ago

The "Overall MTEB English leaderboard" has an averages column, which is very useful.

However, the leaderboards with many languages (Bitext Mining and Classification) have no average, and it's very difficult to assess with such large tables what the multilingual performance is.

Would it be possible to include a micro and macroaverage over the tasks? The macro-average would first average all tasks in a particular language, and then average over languages. The micro-average would average all tasks evenly.

Muennighoff commented 1 year ago

Good point, have been thinking about how to improve that.

For the macro-average, the problem is that some languages only have very few tasks, so when you average over languages the averaging will be very distorted I think. Also it's ambiguous what language to attribute crosslingual datasets to.
For the micro-average IIURC you would first average all scores for each task (regardless of language) & then average the tasks. This seems reasonable & would give a multilingual average in some sense. However, most models only have English scores, which would limit the usefulness as we can't compute averages for many models that miss scores. To display it one could partition the Overall tab into a English & Multilingual tab. Also, need to think about whether or not to include the German clustering tab in that average 🧐
An alternative is to gradually develop Overall tabs for individual languages, however, currently, the limited number of languages that have multiple tasks make this not super useful I think. Only for German, it may be worth it, as there are Classification, Clustering & STS scores for it.
Another alternative is to add an Average & Rank column for each task in their respective tabs. This way one could inspect the Bitext Mining Average, German Clustering Average, Multilingual STS Average etc. This is the most complex one from a coding perspective, but my favorite solution (together w/ 3. as more multilingual datasets are added)

What do you think?

turian commented 1 year ago

@Muennighoff yeah so I've been playing with this dataset a little too. I do think exposing some overall numbers is very important, because people want to make quick judgments of which baseline to start with, and only later refine the model choice as their evaluation methodology on their particular problem improves.

At the very least, having a grand score for Multilingual Classification and Multilingual Bitext Mining would be great.

There are two key issues as I see it: 1) Missing data 2) How to weight / average across tasks

BTW I'm using task to mean dataset here, not the task type (Classification / Bitext Mining).

Missing data

I think this is the more important of the two questions.

For Classification leaderboard, 15 models and 117 tasks. 19% cells are NaN (missing).

For Bitext leaderboard, 65 models and 131 tasks. 64% of the cells are NaN (missing).

The general pattern for these leaderboards is that some models are evaluated on all the tasks. And some models are only evaluated on some tasks (but almost always the same tasks are skipped for these incomplete model rows).

The simplest solution would be that you could pick one reference model, and then all models that don't have scores for all the reference model's tasks would be grayed out and ignored. Similarly, all tasks not scored by the reference model would be grayed out.

Another approach would be to impute the missing values, and display them in a different color. I could explain this in more detail if you are interest.

Task weighting

What's nice and makes your job easier is that for a particular task type, all scores are the same kind. This stuff is much more painful if you have like Classification tasks with AUC and some with f1 etc.

I agree that the simplest idea would be micro-averaging by language or language pair. As you said, you average the scores first of all tasks with the same language. Then you average over all languages. That would be my vote.
Agreed that macro-averaging and treating all tasks as having equal weight makes less sense. Unless you had prior knowledge that the more tasks a language/language-pair has, the more important people consider that language/language-pair. In which case, macro could be a nice alternative.
Lastly, you might wish to consider an option to weight tasks based upon the popularity of the language. There are a handful of ways to do this that I could describe (including log-weighting), if this is of interest to you.

Muennighoff commented 1 year ago

Have added average scores for each task tab! Let me know if you think it should be done differently!

KennethEnevoldsen commented 7 months ago

It would also be great to see a multilingual column tab for both overall and each category. I think a naive mean across languages (pr. task) and then across tasks is a great starting point. This def. leads to the overweighting of certain languages (e.g., the Scandinavian languages), but they are already close to English, and ranking (on multilingual models) generally seems to match quite well (comparing the Scandinavian Embedding Benchmark to MTEB), so it will likely have the expected effect of down-weighting English-only models.

Muennighoff commented 7 months ago

I see so one for each language + one that averages across all languages for generic multilingual models. I think that makes sense, good point.

I would wait for a few more language tabs to come in before adding that - I think there will be French and German Overall tabs soon. It'd also be great to have an Overall tab for the Scandinavian languages (I think we need to reach 20-30 datasets for each for that though - currently especially retrieval is missing).

KennethEnevoldsen commented 7 months ago

Yeah, retrieval is definitely the hardest to find reasonable datasets for (at least if we avoid translations of existing datasets). I have at least one over at SEB (which hasn't been added to MTEB just yet - but that will come). However, we have a few collaborations with Danish Industry partners that might resolve in useable datasets.

There might also be other good sources that can be leveraged into new datasets that I am missing. We experimented a bit with creating a dataset using LLMs to generate questions for context and then re-retrieve the context.

shivrajjadhav733 commented 5 months ago

https://github.com/embeddings-benchmark/mteb/issues/240

Issue with leaderboard

KennethEnevoldsen commented 2 months ago

We are currently working on this in the latest release and have some ongoing discussions in #839. Will close this issue here to avoid duplicates.

Related to #752, #839

embeddings-benchmark / mteb

Average scores on multilingual tasks? #117

Missing data

Task weighting