Paper segment: Task selection

KennethEnevoldsen commented 5 months ago

The goal of this segment is to create meaningful benchmark subsets with a minimal set of tasks.

I believe the steps are as follows:

1) construct an experimental subset. If people agree I can construct one from the Scandinavian Embedding Benchmark. I imagine that this is both small enough but also we assume there is cross-lingual transfer between languages (so some expected redundancy). 2) experiment with methods for subsampling (please provide suggestions):

Predicting Performance for Natural Language Processing Tasks
big-bench approach 3) Deciding on the best approach

If people agree with 1) we can start running the models on tasks = mteb.get_tasks(langauges = ["dan", "nno", "nob", "swe"]). Alternatively, we can also do it on MTEB (english) where we also expect some redundancy (e.g. between arxiv P2P and S2S). The cross-lingual transfer might be more relevant given the multilingual focus.

KennethEnevoldsen commented 5 months ago

So @vaibhavad had a look at the first paper and did a bit of a write-up (mix of their and my own ideas). Let me know what you think.

Following a similar approach as xia et al. (2020), we seek to estimate the performance of an unseen task $U_{ti}$ performance on another set of tasks $S{t_j}, t_j \in T \setminus \{t_i \}$. We will, in our case, leave the target task $t_i$ arbitrary, as it could be any downstream task of the model. Note that keeping the task arbitrary and estimating the expected performance on an unseen can be equated to estimating a generalization factor of the model (Chollet, ~2020). ...

Under this formulation, we then consider task selection as a feature reduction/redundancy problem, a well-formulated problem with machine learning (cite missing), where we seek to remove features that either share the same information or aren’t predictive of the target variable. We treat the performance of existing tasks as features. Note the approach used by Xia et al. (2020) follows a ‘forward feature selection’ approach.

This approach leaves a whole slew of known approaches e.g. feature specific approaches e.g. filtering based on Mutual information, feature correlation, or variance within a feature. We don't even need the model in this case for anything other than the argument.

However, we might want to make the model anyway as it might help us in the aggregation step -- predictive performance on task U_i is really what we are interested in when evaluating -- here I don't believe that following Xia et al. is very promising. Their approach seems to avoid the notion that if task t1 is predictive of the model performance on task t2, then I would also imagine it more likely to be predictive of t3.

here I would probably suggest a hierarchical (bayesian) model. I have some written notes on this but I want to get your opinion about the general idea before moving on.

vaibhavad commented 5 months ago

Adding another related paper

Choosing Transfer Languages for Cross-Lingual Learning

vaibhavad commented 5 months ago

@KennethEnevoldsen,

I read the first paper too and I agree with the assessment:

here I don't believe that following Xia et al. is very promising

Xia et al's approach is tested in a scenario with many assumptions, most of which are not true in our MMTEB scenario. It assumes access to past training experimental records, training data, model details, all of which are not necessarily available in our setting.

predictive performance on task U_i is really what we are interested in when evaluating

Isn't the goal to find a good representative subset? I believe task performance prediction is kind of an implicit goal. We want to model something similar to that mentioned in Section 5 of Xia et al (What Datasets Should We Test On?), but using different predictors than those used in the paper.

Regarding evaluation: It would be best to define out criteria early on. I believe that model rank correlation with the full set of tasks is a good evaluation metric to evaluate different subsets of tasks. Let me know if you have any more suggestions.

KennethEnevoldsen commented 5 months ago

Regarding evaluation: It would be best to define out criteria early on. I believe that model rank correlation with the full set of tasks is a good evaluation metric to evaluate different subsets of tasks. Let me know if you have any more suggestions.

I think that is a great place to start. For those tasks where we have multiple scores (e.g. classification and clustering), we can use significant rank (if two models very similar but one task places then 1,2 and another 2,1, but without meaningful difference we don't care)

predictive performance on task U_i is really what we are interested in when evaluating

Isn't the goal to find a good representative subset? I believe task performance prediction is kind of an implicit goal. We want to model something similar to that mentioned in Section 5 of Xia et al (What Datasets Should We Test On?), but using different predictors than those used in the paper.

I believe so, but a good representative subset should be able to predict the performance on an unseen task otherwise we believe that the benchmark is not representative (or that the task is poorly specified or otherwise bad).

vaibhavad commented 5 months ago

So I ran some experiments, I'll detail down the setup and the results. I'll love to get some input on it. All the code and results are in mteb/task-metric-selection fork.

The goal is to find tasks (and as a by-product models), for which the performance can be predicted. As a case study, I'm doing it on current MTEB version of 56 tasks.

First, I scraped the MTEB leaderboard, and filtered models for which we have scores for all 56 tasks. This resulted in 120 models. Then, I formulate the predictive problem as follows - given the scores of a set of models on all 56 tasks, and given the scores of a new model on 55 tasks, can we predict the performance on the left-out task.

For each task $t$ and model $M$ pair, I trained a linear regression on the scores of other remaining models on remaining 55 tasks, and predicted the performance of $M$ on $t$, using other task scores as predictors. Then I calculated the MSE with the actual score of $M$ on $t$.

This is repeated for every task $t$ and model $M$ pair, hence 56 X 120 runs. Once we have MSE for model and task pair, the average MSE per task can give an indication on how easy it is to predict that task performance. A similar ranking can be devised for the models as well.

Here are the task and model rankings, according at average MSE scores	Task	Avg. MSE
STS14	0.981216
AskUbuntuDupQuestions	1.21632
TwitterURLCorpus	1.26244
BiorxivClusteringP2P	1.45046
StackOverflowDupQuestions	1.53067
STS16	1.62157
STS13	1.73542
ArxivClusteringP2P	1.91204
SummEval	2.00655
MedrxivClusteringP2P	2.03356

Model	Avg. MSE
mixedbread-ai/mxbai-embed-large-v1	0.685644
w601sxs/b1ade-embed	1.0392
nomic-embed-text-v1.5-128	1.08858
avsolatorio/GIST-small-Embedding-v0	1.11598
thenlper/gte-base	1.11654
nomic-ai/nomic-embed-text-v1	1.24907
corto-ai/nomic-embed-text-v1	1.24907
nomic-embed-text-v1.5-512	1.39268
avsolatorio/GIST-large-Embedding-v0	1.53398
Muennighoff/SGPT-2.7B-weightedmean-msmarco-specb-bitfit	1.56351

I'll further use this scraped leaderboard to implement Borda’s count. for #839 and see how much the rankings change

KennethEnevoldsen commented 5 months ago

Looking great @vaibhavad. A few points (partly based on a discussion with @x-tabdeveloping yesterday).

For each task $t$ and model $M$ pair, I trained a linear regression on the scores of other remaining models on remaining 55 tasks, and predicted the performance of $M$ on $t$, using other task scores as predictors. Then I calculated the MSE with the actual score of $M$ on $t$.

We assume here that the betas for each task are independent (a reasonable, but probably naive assumption). We could model it in a hierarchical fashion such that if task $t_A$ is important for predicting $t_B$ it is probably also important for predicting task $t_C$
Using MSE seems problematic as some tasks might have more noise/variance (thus increasing MSE). We might instead be interested in predicting the rank (maybe spearman's between predicted values and observed is better along with $R^2$)

Regarding which model is the most predictable, I am actually quite happy about this. I believe it could be another interesting metric (it is essentially a robustness/reliability metric - I could imagine e.g. prompting models will perform poorly on this).

KennethEnevoldsen commented 5 months ago

So exactly the same analysis but using a different comparison metrics (MSE, MSE scaled, Spearman and Pearson). Also did one using XGBoost.

Task prediction from Linear Model

``` MSE (ascending=True) 0 STS14 0.981216 AskUbuntuDupQuestions 1.216323 TwitterURLCorpus 1.262440 BiorxivClusteringP2P 1.450457 StackOverflowDupQuestions 1.530674 STS16 1.621571 STS13 1.735421 ArxivClusteringP2P 1.912039 SummEval 2.006546 MedrxivClusteringP2P 2.033557 ------ MSE with zscore (ascending=True) 0 STS14 0.028116 STS13 0.057351 StackOverflowDupQuestions 0.068180 NQ 0.068667 AskUbuntuDupQuestions 0.072197 ArxivClusteringP2P 0.075123 STS16 0.079436 BiorxivClusteringP2P 0.081859 AmazonPolarityClassification 0.084486 MassiveIntentClassification (en) 0.086144 ------ Spearman (ascending=False) 0 STS14 0.976071 BiorxivClusteringS2S 0.975782 ImdbClassification 0.970131 MassiveIntentClassification (en) 0.967380 ArxivClusteringS2S 0.958237 StackExchangeClustering 0.956647 AskUbuntuDupQuestions 0.955223 ArxivClusteringP2P 0.952062 AmazonReviewsClassification (en) 0.951333 RedditClustering 0.950468 ------ Pearson (ascending=False) 0 STS14 0.985942 STS13 0.971325 StackOverflowDupQuestions 0.965910 NQ 0.965666 AskUbuntuDupQuestions 0.963902 ArxivClusteringP2P 0.962439 STS16 0.960282 BiorxivClusteringP2P 0.959071 AmazonPolarityClassification 0.957757 MassiveIntentClassification (en) 0.956928 ------ ```

Task prediction from XGBoost

``` MSE (ascending=True) 0 MindSmallReranking 0.937350 SummEval 0.991341 MedrxivClusteringS2S 1.602247 MedrxivClusteringP2P 1.670227 BiorxivClusteringP2P 1.853750 ArxivClusteringP2P 1.873055 BiorxivClusteringS2S 2.086671 MTOPDomainClassification (en) 2.106947 AskUbuntuDupQuestions 2.114875 MassiveIntentClassification (en) 2.247917 ------ MSE with zscore (ascending=True) 0 NQ 0.047519 AmazonPolarityClassification 0.049662 FiQA2018 0.060705 ImdbClassification 0.062859 MassiveIntentClassification (en) 0.069679 AmazonReviewsClassification (en) 0.072001 BiorxivClusteringS2S 0.073636 ArxivClusteringP2P 0.076457 RedditClustering 0.081129 TwentyNewsgroupsClustering 0.082345 ------ Spearman (ascending=False) 0 AmazonPolarityClassification 0.981132 MassiveIntentClassification (en) 0.974150 NFCorpus 0.970216 ImdbClassification 0.967747 AmazonReviewsClassification (en) 0.966336 STS14 0.965624 RedditClustering 0.964495 TwentyNewsgroupsClustering 0.964438 FiQA2018 0.960536 BiorxivClusteringS2S 0.960529 ------ Pearson (ascending=False) 0 NQ 0.976241 AmazonPolarityClassification 0.975169 FiQA2018 0.969648 ImdbClassification 0.968571 MassiveIntentClassification (en) 0.965161 AmazonReviewsClassification (en) 0.963999 BiorxivClusteringS2S 0.963182 ArxivClusteringP2P 0.961772 RedditClustering 0.959435 TwentyNewsgroupsClustering 0.958828 ------ ```

A more general point is of course that we might have STS14 being very corraleted with STS13 and thus might not want to remove both.

KennethEnevoldsen commented 5 months ago

To get some more intuition we have:

Task variance

![70ed8853-0e0e-488c-9df2-361d06897cc6](https://github.com/embeddings-benchmark/mteb/assets/23721977/cbf8b737-2308-4ac4-9f26-514afb3941a5)

Pearson correlation matrix

![391644b9-b6c2-46db-b4f3-0c1a51ac0ef4](https://github.com/embeddings-benchmark/mteb/assets/23721977/18702c29-57e9-4e0d-a885-f2afe35898a9)

Spearman correlation matrix

![21782f39-fd68-4ea9-bd83-d767d0ba97e1](https://github.com/embeddings-benchmark/mteb/assets/23721977/23821740-1165-4a05-8077-f6cedb47d586)

The work is also available on this branch.

KennethEnevoldsen commented 5 months ago

@vaibhavad I did some additional analysis + reading:

1) I generally found that using a beta regression (I just created one in PyMC) generally seems to result in better predictions (0.90-0.92 vs 0.96-0.97) as compared to a linear model. However, this was only for a single task (WIP in the branch above)

2) Did a plot where I plot the scores on each task according to model rank on MTEB to examine potential issues. I believe it gives a decent overview:

plot

I have notably highlighted potential issues with some of the tasks (lack of correlation with benchmark or little variance)

3) I read through BigBench. It seems like their task selection for the lite version is predominantly heuristics based:

In selecting BBL [BIG-bench Lite], core contributors went through a selection process of the tasks based on task keyword coverage and inclusion of certain task types such as code, non-English capabilities and measuring bias.

However they do have some selection criteria, some of which we might apply as well:

Specificity: Tasks should aim to cleanly capture some specific capability of language models.
Difficulty: Tasks must not be fully solvable by existing language models.
Novelty: Tasks should fill a gap in coverage by BIG-bench

(some have been implicit and explicit when submitting a dataset)

vaibhavad commented 5 months ago

@KennethEnevoldsen,

Thanks for sharing the plots, it is very helpful! I think overall now we have a neat set of ideas, and a combination of all of these can be used for the final task selection.

First part of selection set can be manual, to ensure representation of languages and task type, similar to Big-Bench. This can be a small set of whitelisted tasks, which cannot be removed by successive automatic filtering steps.
Next, we can recursively take out tasks for which it is easy to predict the performance. For this let's say we lock down a metric (out of Pearson, Spearman, etc), then we can choose the best model (Linear, XGBoost, PyMC etc). The baselines for justifying this approach can be random search and top-k most correlated tasks (when selecting k tasks). Using Linear Regression with Pearson metric recursively on current MTEB led to these 10 tasks being removed, with final correlation being ~0.95 (Can be replicated by running the script here)
```
STS14
BiorxivClusteringS2S
ImdbClassification
MassiveIntentClassification (en)
ArxivClusteringS2S
RedditClustering
NQ
AmazonReviewsClassification (en)
AskUbuntuDupQuestions
MedrxivClusteringP2P
```
The list looks balanced in terms of task types.
The last part is what I am unclear about. This involves tasks with low variance, like SummEval. It is unlikely that this task will be removed in the above step, as the low variance makes it hard to predict. Furthermore, as we discussed, low variance can mean either noise or the task being difficult. I guess we can inspect the lowest 10% of tasks manually to decide whether to keep them or not.

Do you think we should add anything else in the selection criteria? If not, we can start discussing the specifics.

KennethEnevoldsen commented 5 months ago

Generally agree

we could introduce a step 1.5) where we find highly correlated clusters and manually select the "best" one (smaller, open license, higher quality, better documented) and add it to the "keep" list.

For step 3) I really just want to "in a subjective fashion" differentiate between noisy tasks or difficult tasks. We can only differentiate between these two manually atm. (though I think there is an interesting modelling avenue here).

vaibhavad commented 5 months ago

Some more details on Step 2.

I think it is important to differentiate between two different correlations.

a) The correlation for a specific task between the rankings produced by the predictive model and the actual rankings of the task. This is what we are using to find most predictive task.

b) The correlation between the rankings from the overall score of remaining tasks and the overall scores of the entire benchmark. This is what we actually care about - finding the smaller subset of tasks that highly correlates with the entire set of tasks. This may be loosely correlated with 1)

As we recursively apply the filtering technique based on 2 to remove 50 out of 56 tasks, this is how the two correlations change (Using linear regression and spearman for both a)and b) )

The tasks removed in order are:

STS14
BiorxivClusteringS2S
ImdbClassification
MassiveIntentClassification (en)
ArxivClusteringS2S
RedditClustering
NQ
AmazonReviewsClassification (en)
AskUbuntuDupQuestions
MedrxivClusteringP2P
StackOverflowDupQuestions
NFCorpus
BiorxivClusteringP2P
MassiveScenarioClassification (en)
STS16
TwitterSemEval2015
SciFact
DBPedia
STSBenchmark
Banking77Classification
SciDocsRR
RedditClusteringP2P
STS15
TwentyNewsgroupsClustering
FiQA2018
FEVER
STS13
HotpotQA
MedrxivClusteringS2S
ArguAna
TRECCOVID
ArxivClusteringP2P
MTOPDomainClassification (en)
SICK-R
CQADupstackRetrieval
ClimateFEVER
TweetSentimentExtractionClassification
StackExchangeClusteringP2P
EmotionClassification
TwitterURLCorpus
QuoraRetrieval
SCIDOCS
AmazonPolarityClassification
StackExchangeClustering
BIOSSES
MSMARCO
AmazonCounterfactualClassification (en)
MindSmallReranking
SprintDuplicateQuestions
STS17 (en-en)

KennethEnevoldsen commented 5 months ago

Ahh yes totally agree!

Regarding MTEB specifically. We might want to replace the older versions of the task (e.g. ArxivClusteringP2P > ArxivClusteringP2P.v2), while still correlating with the original scores.

So 1) replace, 2) correlate, 3) reduce, 3) repeat from 2

Otherwise I believe this section is generally at a point where we will need the results to continue. WDYT?

vaibhavad commented 5 months ago

I agree, we can pick this up again once we have the results from the larger benchmark.

bwanglzu commented 4 months ago

hi @KennethEnevoldsen thanks for your initiative on doing it!

at Jina AI we want to do the similar: work a mini version of selected tasks, including HotpotQA/Fever/ClimateFever/NQ/MSMARCO and MIRACL to ensure all evaluation on all languages can be finished in a reasonable amount of time (let's say 15 minutes per task per lang).

we don't have a problem with other tasks such as clustering/classification/sts/reranking since they are relatively fast and not that crucial for our in house evaluation.

currently we want to try different sampling strategy from the original set and want to use a range of dense/colbert models to validate the correlation.

@tomaarsen @robro612

KennethEnevoldsen commented 4 months ago

Hi @bwanglzu sounds like what we are currently doing with #836 (cc @orionw and @vaibhavad for the latest update). We will have an update on that quite soon, but the approach samples hard negatives from 3 models and we have already shown for 3 sample tasks that enough hard negatives from 1 model provide an almost perfect correlation with the original scores. However, we are more than happy to implement more naive sampling strategies. E.g. for Clustering tasks, we show that using a random sample of 4% of the original data gives a close-to-perfect reconstruction of model ranks. We plan to apply this method to the largest datasets within MMTEB.

@orionw and @vaibhavad do correct me if something is missing.

bwanglzu commented 4 months ago

perfect! thanks @KennethEnevoldsen ! Just wondering do you have a communication channel like slack/discord or whatever we can jump in and discuss (maybe contribute)?

KennethEnevoldsen commented 4 months ago

You and @robro612 should have gotten an invite

gentaiscool commented 2 months ago

Hi @KennethEnevoldsen I am also happy to be added to the communication channel you used for discussion if you don't mind.

KennethEnevoldsen commented 2 months ago

@gentaiscool I sadly can't add you, but I believe @Muennighoff can. If this is about specific concerns feel free to reach out to me by mail as well: kenneth.enevoldsen@cas.au.dk

KennethEnevoldsen commented 2 months ago

Will close this for now - the section in the paper is written and notebooks are available under scripts

embeddings-benchmark / mteb

Paper segment: Task selection #837