Closed KennethEnevoldsen closed 2 months ago
So @vaibhavad had a look at the first paper and did a bit of a write-up (mix of their and my own ideas). Let me know what you think.
Following a similar approach as xia et al. (2020), we seek to estimate the performance of an unseen task $U_{ti}$ performance on another set of tasks $S{t_j}, t_j \in T \setminus \{t_i \}$. We will, in our case, leave the target task $t_i$ arbitrary, as it could be any downstream task of the model. Note that keeping the task arbitrary and estimating the expected performance on an unseen can be equated to estimating a generalization factor of the model (Chollet, ~2020). ...
Under this formulation, we then consider task selection as a feature reduction/redundancy problem, a well-formulated problem with machine learning (cite missing), where we seek to remove features that either share the same information or aren’t predictive of the target variable. We treat the performance of existing tasks as features. Note the approach used by Xia et al. (2020) follows a ‘forward feature selection’ approach.
This approach leaves a whole slew of known approaches e.g. feature specific approaches e.g. filtering based on Mutual information, feature correlation, or variance within a feature. We don't even need the model in this case for anything other than the argument.
However, we might want to make the model anyway as it might help us in the aggregation step -- predictive performance on task U_i is really what we are interested in when evaluating -- here I don't believe that following Xia et al. is very promising. Their approach seems to avoid the notion that if task t1 is predictive of the model performance on task t2, then I would also imagine it more likely to be predictive of t3.
here I would probably suggest a hierarchical (bayesian) model. I have some written notes on this but I want to get your opinion about the general idea before moving on.
Adding another related paper
@KennethEnevoldsen,
I read the first paper too and I agree with the assessment:
here I don't believe that following Xia et al. is very promising
Xia et al's approach is tested in a scenario with many assumptions, most of which are not true in our MMTEB scenario. It assumes access to past training experimental records, training data, model details, all of which are not necessarily available in our setting.
predictive performance on task U_i is really what we are interested in when evaluating
Isn't the goal to find a good representative subset? I believe task performance prediction is kind of an implicit goal. We want to model something similar to that mentioned in Section 5 of Xia et al (What Datasets Should We Test On?
), but using different predictors than those used in the paper.
Regarding evaluation: It would be best to define out criteria early on. I believe that model rank correlation with the full set of tasks is a good evaluation metric to evaluate different subsets of tasks. Let me know if you have any more suggestions.
Regarding evaluation: It would be best to define out criteria early on. I believe that model rank correlation with the full set of tasks is a good evaluation metric to evaluate different subsets of tasks. Let me know if you have any more suggestions.
I think that is a great place to start. For those tasks where we have multiple scores (e.g. classification and clustering), we can use significant rank (if two models very similar but one task places then 1,2 and another 2,1, but without meaningful difference we don't care)
predictive performance on task U_i is really what we are interested in when evaluating
Isn't the goal to find a good representative subset? I believe task performance prediction is kind of an implicit goal. We want to model something similar to that mentioned in Section 5 of Xia et al (What Datasets Should We Test On?), but using different predictors than those used in the paper.
I believe so, but a good representative subset should be able to predict the performance on an unseen task otherwise we believe that the benchmark is not representative (or that the task is poorly specified or otherwise bad).
So I ran some experiments, I'll detail down the setup and the results. I'll love to get some input on it. All the code and results are in mteb/task-metric-selection fork.
The goal is to find tasks (and as a by-product models), for which the performance can be predicted. As a case study, I'm doing it on current MTEB version of 56 tasks.
First, I scraped the MTEB leaderboard, and filtered models for which we have scores for all 56 tasks. This resulted in 120 models. Then, I formulate the predictive problem as follows - given the scores of a set of models on all 56 tasks, and given the scores of a new model on 55 tasks, can we predict the performance on the left-out task.
For each task $t$ and model $M$ pair, I trained a linear regression on the scores of other remaining models on remaining 55 tasks, and predicted the performance of $M$ on $t$, using other task scores as predictors. Then I calculated the MSE with the actual score of $M$ on $t$.
This is repeated for every task $t$ and model $M$ pair, hence 56 X 120 runs. Once we have MSE for model and task pair, the average MSE per task can give an indication on how easy it is to predict that task performance. A similar ranking can be devised for the models as well.
Here are the task and model rankings, according at average MSE scores | Task | Avg. MSE |
---|---|---|
STS14 | 0.981216 | |
AskUbuntuDupQuestions | 1.21632 | |
TwitterURLCorpus | 1.26244 | |
BiorxivClusteringP2P | 1.45046 | |
StackOverflowDupQuestions | 1.53067 | |
STS16 | 1.62157 | |
STS13 | 1.73542 | |
ArxivClusteringP2P | 1.91204 | |
SummEval | 2.00655 | |
MedrxivClusteringP2P | 2.03356 |
Model | Avg. MSE |
---|---|
mixedbread-ai/mxbai-embed-large-v1 | 0.685644 |
w601sxs/b1ade-embed | 1.0392 |
nomic-embed-text-v1.5-128 | 1.08858 |
avsolatorio/GIST-small-Embedding-v0 | 1.11598 |
thenlper/gte-base | 1.11654 |
nomic-ai/nomic-embed-text-v1 | 1.24907 |
corto-ai/nomic-embed-text-v1 | 1.24907 |
nomic-embed-text-v1.5-512 | 1.39268 |
avsolatorio/GIST-large-Embedding-v0 | 1.53398 |
Muennighoff/SGPT-2.7B-weightedmean-msmarco-specb-bitfit | 1.56351 |
I'll further use this scraped leaderboard to implement Borda’s count. for #839 and see how much the rankings change
Looking great @vaibhavad. A few points (partly based on a discussion with @x-tabdeveloping yesterday).
For each task $t$ and model $M$ pair, I trained a linear regression on the scores of other remaining models on remaining 55 tasks, and predicted the performance of $M$ on $t$, using other task scores as predictors. Then I calculated the MSE with the actual score of $M$ on $t$.
Regarding which model is the most predictable, I am actually quite happy about this. I believe it could be another interesting metric (it is essentially a robustness/reliability metric - I could imagine e.g. prompting models will perform poorly on this).
So exactly the same analysis but using a different comparison metrics (MSE, MSE scaled, Spearman and Pearson). Also did one using XGBoost.
A more general point is of course that we might have STS14 being very corraleted with STS13 and thus might not want to remove both.
To get some more intuition we have:
The work is also available on this branch.
@vaibhavad I did some additional analysis + reading:
1) I generally found that using a beta regression (I just created one in PyMC) generally seems to result in better predictions (0.90-0.92 vs 0.96-0.97) as compared to a linear model. However, this was only for a single task (WIP in the branch above)
2) Did a plot where I plot the scores on each task according to model rank on MTEB to examine potential issues. I believe it gives a decent overview:
I have notably highlighted potential issues with some of the tasks (lack of correlation with benchmark or little variance)
3) I read through BigBench. It seems like their task selection for the lite version is predominantly heuristics based:
In selecting BBL [BIG-bench Lite], core contributors went through a selection process of the tasks based on task keyword coverage and inclusion of certain task types such as code, non-English capabilities and measuring bias.
However they do have some selection criteria, some of which we might apply as well:
(some have been implicit and explicit when submitting a dataset)
@KennethEnevoldsen,
Thanks for sharing the plots, it is very helpful! I think overall now we have a neat set of ideas, and a combination of all of these can be used for the final task selection.
First part of selection set can be manual, to ensure representation of languages and task type, similar to Big-Bench. This can be a small set of whitelisted tasks, which cannot be removed by successive automatic filtering steps.
Next, we can recursively take out tasks for which it is easy to predict the performance. For this let's say we lock down a metric (out of Pearson, Spearman, etc), then we can choose the best model (Linear, XGBoost, PyMC etc). The baselines for justifying this approach can be random search and top-k most correlated tasks (when selecting k tasks). Using Linear Regression with Pearson metric recursively on current MTEB led to these 10 tasks being removed, with final correlation being ~0.95 (Can be replicated by running the script here)
STS14
BiorxivClusteringS2S
ImdbClassification
MassiveIntentClassification (en)
ArxivClusteringS2S
RedditClustering
NQ
AmazonReviewsClassification (en)
AskUbuntuDupQuestions
MedrxivClusteringP2P
The list looks balanced in terms of task types.
The last part is what I am unclear about. This involves tasks with low variance, like SummEval. It is unlikely that this task will be removed in the above step, as the low variance makes it hard to predict. Furthermore, as we discussed, low variance can mean either noise or the task being difficult. I guess we can inspect the lowest 10% of tasks manually to decide whether to keep them or not.
Do you think we should add anything else in the selection criteria? If not, we can start discussing the specifics.
Generally agree
we could introduce a step 1.5) where we find highly correlated clusters and manually select the "best" one (smaller, open license, higher quality, better documented) and add it to the "keep" list.
For step 3) I really just want to "in a subjective fashion" differentiate between noisy tasks or difficult tasks. We can only differentiate between these two manually atm. (though I think there is an interesting modelling avenue here).
Some more details on Step 2.
I think it is important to differentiate between two different correlations.
a) The correlation for a specific task between the rankings produced by the predictive model and the actual rankings of the task. This is what we are using to find most predictive task.
b) The correlation between the rankings from the overall score of remaining tasks and the overall scores of the entire benchmark. This is what we actually care about - finding the smaller subset of tasks that highly correlates with the entire set of tasks. This may be loosely correlated with 1)
As we recursively apply the filtering technique based on 2 to remove 50 out of 56 tasks, this is how the two correlations change (Using linear regression and spearman for both a)and b) )
The tasks removed in order are:
STS14
BiorxivClusteringS2S
ImdbClassification
MassiveIntentClassification (en)
ArxivClusteringS2S
RedditClustering
NQ
AmazonReviewsClassification (en)
AskUbuntuDupQuestions
MedrxivClusteringP2P
StackOverflowDupQuestions
NFCorpus
BiorxivClusteringP2P
MassiveScenarioClassification (en)
STS16
TwitterSemEval2015
SciFact
DBPedia
STSBenchmark
Banking77Classification
SciDocsRR
RedditClusteringP2P
STS15
TwentyNewsgroupsClustering
FiQA2018
FEVER
STS13
HotpotQA
MedrxivClusteringS2S
ArguAna
TRECCOVID
ArxivClusteringP2P
MTOPDomainClassification (en)
SICK-R
CQADupstackRetrieval
ClimateFEVER
TweetSentimentExtractionClassification
StackExchangeClusteringP2P
EmotionClassification
TwitterURLCorpus
QuoraRetrieval
SCIDOCS
AmazonPolarityClassification
StackExchangeClustering
BIOSSES
MSMARCO
AmazonCounterfactualClassification (en)
MindSmallReranking
SprintDuplicateQuestions
STS17 (en-en)
Ahh yes totally agree!
Regarding MTEB specifically. We might want to replace the older versions of the task (e.g. ArxivClusteringP2P > ArxivClusteringP2P.v2), while still correlating with the original scores.
So 1) replace, 2) correlate, 3) reduce, 3) repeat from 2
Otherwise I believe this section is generally at a point where we will need the results to continue. WDYT?
I agree, we can pick this up again once we have the results from the larger benchmark.
hi @KennethEnevoldsen thanks for your initiative on doing it!
at Jina AI we want to do the similar: work a mini version of selected tasks, including HotpotQA/Fever/ClimateFever/NQ/MSMARCO and MIRACL to ensure all evaluation on all languages can be finished in a reasonable amount of time (let's say 15 minutes per task per lang).
we don't have a problem with other tasks such as clustering/classification/sts/reranking since they are relatively fast and not that crucial for our in house evaluation.
currently we want to try different sampling strategy from the original set and want to use a range of dense/colbert models to validate the correlation.
@tomaarsen @robro612
Hi @bwanglzu sounds like what we are currently doing with #836 (cc @orionw and @vaibhavad for the latest update). We will have an update on that quite soon, but the approach samples hard negatives from 3 models and we have already shown for 3 sample tasks that enough hard negatives from 1 model provide an almost perfect correlation with the original scores. However, we are more than happy to implement more naive sampling strategies. E.g. for Clustering tasks, we show that using a random sample of 4% of the original data gives a close-to-perfect reconstruction of model ranks. We plan to apply this method to the largest datasets within MMTEB.
@orionw and @vaibhavad do correct me if something is missing.
perfect! thanks @KennethEnevoldsen ! Just wondering do you have a communication channel like slack/discord or whatever we can jump in and discuss (maybe contribute)?
You and @robro612 should have gotten an invite
Hi @KennethEnevoldsen I am also happy to be added to the communication channel you used for discussion if you don't mind.
@gentaiscool I sadly can't add you, but I believe @Muennighoff can. If this is about specific concerns feel free to reach out to me by mail as well: kenneth.enevoldsen@cas.au.dk
Will close this for now - the section in the paper is written and notebooks are available under scripts
The goal of this segment is to create meaningful benchmark subsets with a minimal set of tasks.
I believe the steps are as follows:
1) construct an experimental subset. If people agree I can construct one from the Scandinavian Embedding Benchmark. I imagine that this is both small enough but also we assume there is cross-lingual transfer between languages (so some expected redundancy). 2) experiment with methods for subsampling (please provide suggestions):
If people agree with 1) we can start running the models on
tasks = mteb.get_tasks(langauges = ["dan", "nno", "nob", "swe"])
. Alternatively, we can also do it on MTEB (english) where we also expect some redundancy (e.g. between arxiv P2P and S2S). The cross-lingual transfer might be more relevant given the multilingual focus.