Open KennethEnevoldsen opened 2 months ago
The paper referenced above provides a nice starting point for an evaluation metric that can substitute averaging. The proposed method uses instance-level and task-level rankings rather than scores to compute final system-level scores and ranks.
I am not sure how practically feasible will it be to store every instance-level performance for each task. Using ranking on tasks does have its limitations, like not being sensitive to difficulty of the task, or the small differences in performance being ignored (as discussed in #752 ).
I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it.
Essentially, I think we should be using the distribution of model scores on a task to evaluate a new model. The distribution of scores is also how we ourselves judge the difficulty of the task. If the distribution is peaked at a low score, it means it is a difficult task.
So the evaluation that I have in mind is something like this - first we'll have a set of reference models. For any task, we'll have a distribution of their scores. Then we will estimate the parameters of that distribution (this might require some assumptions such as the underlying distribution is Gaussian or something else). The score of a new candidate model will be the percentile on that distribution. That way, a model which makes a breakthrough on a difficult task will be appropriately awarded (high percentile). This method still retains benefits of comparing against systems instead of averaging scores, while also quantifying the task difficulty.
As I said before, this is still a half-baked idea. Intuitively, I feel it still retains nice theoretical properties of Borda’s count (proposed in the paper), but we may have to formally and empirically prove that.
One drawback of this evaluation is the selection of reference models. For benchmarking purposes, we'll have to keep it fixed, however, the benchmark will evolve over time and the set of reference models may not be a good representative set over time.
Curious to know your thoughts on this
I am not sure how practically feasible will it be to store every instance-level performance for each task.
We do not have instance level rank, but for some tasks, we have repeated (typically 10 to calculate std and ci). I don't think it is feasible at least for the current iteration of the benchmark.
I am trying thinking of a solution that resolves these issues, while still being better than taking mean. I have a half-baked idea, and it will be helpful to get some feedback on it.
I too have an idea and it might be worth a joint discussion on these ideas.
If the distribution is peaked at a low score, it means it is a difficult task.
Or that there is a lot of noise (performance can't get any better) - I am unsure how to differentiate the two.
first we'll have a set of reference models
We have discussed something like this for the ScandEval NLU benchmark. However, choosing a reference is quite hard.
I believe our three options are:
Another approach that I will also add to the table is modeling it as a generalization factor (a latent factor similar to IQ). This also allows for some hypothesis testing, e.g. do we believe that there is one underlying "language understanding factor" or do we believe that a model has multiple e.g. for language groups or for specific tasks?
Also worth mentioning is that there is no reason why we should require only one metric. We should just have a default in the dashboard.
After implementing Borda's count as a ranking mechanism, here is the change in rank for the top 20 models in the current leaderboard. The script is here.
Model | Overall score | Borda score | Original Rank | Borda Rank | Change in Rank |
---|---|---|---|---|---|
nvidia/NV-Embed-v1 | 69.3186 | 873 | 1 | 3 | -2 |
voyage-large-2-instruct | 68.2793 | 874 | 2 | 4 | -2 |
Linq-AI-Research/Linq-Embed-Mistral | 68.1745 | 570 | 3 | 1 | 2 |
Salesforce/SFR-Embedding-Mistral | 67.557 | 652 | 4 | 2 | 2 |
gte-Qwen1.5-7B-instruct | 67.3437 | 1147 | 5 | 8 | -3 |
Alibaba-NLP/gte-Qwen1.5-7B-instruct | 67.3436 | 1148 | 6 | 9 | -3 |
voyage-lite-02-instruct | 67.127 | 1153 | 7 | 10 | -3 |
GritLM/GritLM-7B | 66.7634 | 1042 | 8 | 6 | 2 |
intfloat/e5-mistral-7b-instruct | 66.6334 | 908 | 9 | 5 | 4 |
google-gecko.text-embedding-preview-0409 | 66.3136 | 1082 | 10 | 7 | 3 |
GritLM/GritLM-8x7B | 65.6568 | 1365 | 11 | 12 | -1 |
Alibaba-NLP/gte-large-en-v1.5 | 65.3905 | 1783 | 12 | 25 | -13 |
LLM2Vec-Meta-Llama-3-supervised | 65.0057 | 1686 | 13 | 21 | -8 |
LLM2Vec-Mistral-supervised | 64.8018 | 1679 | 14 | 20 | -6 |
jspringer/echo-mistral-7b-instruct-lasttoken | 64.6837 | 1723 | 15 | 23 | -8 |
mixedbread-ai/mxbai-embed-large-v1 | 64.683 | 1334 | 16 | 11 | 5 |
WhereIsAI/UAE-Large-V1 | 64.6357 | 1399 | 17 | 13 | 4 |
text-embedding-3-large | 64.5896 | 1877 | 18 | 28 | -10 |
voyage-lite-01-instruct | 64.4916 | 1795 | 19 | 26 | -7 |
Cohere/Cohere-embed-english-v3.0 | 64.4743 | 1635 | 20 | 17 | 3 |
There is some shuffling in the top 10, but as a set, the same 10 models remain in the top 10. The shift in ranks is much more prominent in models beyond the top 10.
@vaibhavad can you also have a column with actual scores
@sivareddyg - I updated the comment above with actual scores
Should we add bordo count as well (I want to see how well it gives a notion of closeness).
Another point is that I don't believe this metric considers task correlation. Which in the context of voting is fine (that is what we want), but in the context of model development, we don't want to bias our model ranking toward medical just because we include both MedrxivClusteringS2S and MedrxivClusteringP2P.
Allowing for a silly example, but which I believe is adequate here: If we want to estimate a person's height (the models' ranking), getting the height of their right leg (task A) is a good first step. However, adding the second leg (task B) shouldn't add much information to our estimate of the height (rank). Getting the torso (task C), though, should add more. Thus, assuming equal weight in votes seems problematic in our case as some of the votes supply the same information.
It would be another thing if we believed our distribution of tasks represented the real-world use cases (which I don't believe is the case).
Why does this become important? When we, e.g., in https://github.com/embeddings-benchmark/mteb/issues/837, filter out correlated tasks (implicitly or explicitly), we believe that we don't lose too much information, but that might change the rank meaningfully (we can test this).
A simple solution is, of course, filtering tasks before we do the bordo count. However, it does annoy me that the metric is sensitive to adding correlated tasks (which should really only increase the certainty in our estimate, not make it poorer).
I might be missing something here, do let me know if that is the case.
Here is a proposed alternative. Modeling it as a latent generalization factor;
Where for a given task $t$ we estimate its performance as:
$S_t \sim Beta(\alpha, \beta)$
Where alpha and beta are parametrised as: $\beta = \sigma(g_m) \cdot \phi_t$
$\alpha = (1 - \sigma(g_m)) \cdot \phi_t$
where $g_m$ is the g factor of the model $m$ (note that this is quite similar to beta regression and IQ model for humans). Note this model can be expanded to e.g. include the models a g factor for specific domains or task types.
Comparing the correlation we get:
This further gives us the option to compare models as distributions (estimates of uncertainty)
I tried it using the task reduction as well:
6 best, 14 best, 14 random
Surprisingly, 14 random tasks give a higher (Pearson) correlation with the original score. This is probably due to multiple of the tasks which are the easiest to predict and also the ones that correlate well with other tasks.
Correlation of borda count to mean averaging
That does look fairly reasonable as well.
A actually see that we are kinda going at this from two approaches:
1) Social choice theory / election theory etc.: Which predominantly is concern with ranking of deciding candidates. In general many choices here often sacrifice some important aspects (e.g. see arrow impossibility theorem). However in our case some of these aspects does apply in our "voting system" e.g. we do not believe that any of our "voters" have agency and thus can't attempt to "cheat" the system. Thus we might go through the existing choices and select the most appropriate ones. 2) Psychology / intelligence littérature: Where a latent factor is intended to be measured, which determined model generalization/quality etc.
(1) s generally geared toward selecting the most preferred model by all task while (2) seeks to estimate the model generalisation capability. Luckily atm. the two approaches seem to in general agree, however does rely on quite different assumptions. (2) E.g. has the ability to determine if a task a relevant to gain more information about the latent factor while in (1) tasks (voters) are seem as equals (following democratic ideals)
I think this is a very reasonable thing bring up during the writing of the section.
The goal of this section is to find a meaningful approach to aggregate scores across tasks.
Related to #837 Already discussed in #752
I believe the task is as follows:
1) get a meaningful sample to test on (see #837) 2) find a reasonable set of approaches to test (and discuss pros and cons beforehand) - feel free to add more here: