embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.88k stars 252 forks source link

SummarizationEvaluator mean score issue #1156

Closed YanshekWoo closed 2 months ago

YanshekWoo commented 2 months ago

Code: https://github.com/embeddings-benchmark/mteb/blob/3a3f9cf9688a6dff860788dc5fa4bf9942b8b512/mteb/evaluation/evaluators/SummarizationEvaluator.py#L160

I have noticed that the final score of SummaryEval is calculated from np.mean of spearman_scores.

However, it seems that the items in spearman_scores are SignificanceResult objects, which contain both statistic and pvalue. An example of spearman_scores is as follows:

[SignificanceResult(statistic=0.36804249483125123, pvalue=0.16074763326319716), SignificanceResult(statistic=0.26212077736283135, pvalue=0.3267287471324084), ...]

When applying np.mean, it calculate the mean score of all the statistic and pvalue, instead of mean ofstatistic .

I don't know if there is some misunderstanding of the code or the evaluation method. Please confirm and check this issue.

KennethEnevoldsen commented 2 months ago

Oh that seems like an issue. Thanks for making us aware of it!

I can confirm here that it is indded a problem:

np.mean([p.statistic for p in pearson_scores])
0.3045311382736237
np.mean(pearson_scores)
0.2878194120961996

That is really not what I would expect as the behavior.

@Muennighoff, this seems to be quite an old error. I am afraid it might have been present in the MTEB paper as well. This means that changing it will effectively outdate the results of MTEB (using the fixes version):

from __future__ import annotations

import mteb

tasks = mteb.get_tasks(tasks=["SummEval"])
tasks[0].load_data()

model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2")
eval = mteb.MTEB(tasks=tasks)
res = eval.run(model, overwrite_results=True)
# expected score: 30.81
res[0].get_score() # 0.259711030822813

A solution would be to create a legacy version of the dataset and then create a v2 of the dataset as well (SummEvalv2). That way, we would avoid outdating old results, but the old results would still contain the error. As it stands, SummEval has fairly low descriptive power, which might change given this fix.

YanshekWoo commented 2 months ago

Oh that seems like an issue. Thanks for making us aware of it!

I can confirm here that it is indded a problem:

np.mean([p.statistic for p in pearson_scores])
0.3045311382736237
np.mean(pearson_scores)
0.2878194120961996

That is really not what I would expect as the behavior.

@Muennighoff, this seems to be quite an old error. I am afraid it might have been present in the MTEB paper as well. This means that changing it will effectively outdate the results of MTEB (using the fixes version):

from __future__ import annotations

import mteb

tasks = mteb.get_tasks(tasks=["SummEval"])
tasks[0].load_data()

model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2")
eval = mteb.MTEB(tasks=tasks)
res = eval.run(model, overwrite_results=True)
# expected score: 30.81
res[0].get_score() # 0.259711030822813

A solution would be to create a legacy version of the dataset and then create a v2 of the dataset as well (SummEvalv2). That way, we would avoid outdating old results, but the old results would still contain the error. As it stands, SummEval has fairly low descriptive power, which might change given this fix.

Thank you for your response.

I believe there might be a more effective way to evaluate the results. Currently, SummEval includes only 100 independent examples, with each example point derived from 16 pairs. This often leads to results that are not statistically significant (high p-value). Consequently, SummEval has a substantial impact on the final average scores for English.

Would it be better to calculate the Spearman correlation across all pairs (16 * 100)? Additionally, incorporating more datasets and examples could further enhance the evaluation process.

KennethEnevoldsen commented 2 months ago

Consequently, SummEval has a substantial impact on the final average scores for English.

I am not quite sure that I see that? Removing SummEval results in a very similar ranking as almost all models obtain a score of SummEval ~29-31 on MTEB, which is averaged across 56 datasets.

We have a version of MTEB coming out soon which includes MTEB lite. This version is notably faster. It might be a good place to also remove SummEval given the errors.

Would it be better to calculate the Spearman correlation across all pairs (16 * 100)? Additionally, incorporating more datasets and examples could further enhance the evaluation process.

I think the first step is to fix the bug, but additional datasets would naturally be an improvement.

YanshekWoo commented 2 months ago

We have a version of MTEB coming out soon which includes MTEB lite. This version is notably faster. It might be a good place to also remove SummEval given the errors.

Thanks for your reply and contribution to the commutinity. Looking forward to your new work!

KennethEnevoldsen commented 2 months ago

Thanks @YanshekWoo - I will just keep the issue open as a reminder

Muennighoff commented 2 months ago

Thanks so much for finding this bug! Indeed it seems to invalidate SummEval scores in the MTEB paper as the bug was present from the beginning (https://github.com/embeddings-benchmark/mteb/blob/20c22a919ae07314e3f93f2ddd808e87b0c7dbff/mteb/evaluation/evaluators/SummarizationEvaluator.py ; the additional np.array does not change things) and in previous versions of scipy this was the same (https://docs.scipy.org/doc/scipy-1.9.0/reference/generated/scipy.stats.pearsonr.html).

It probably will have little impact on the actual averages as @KennethEnevoldsen explained but looking forward to see if scores are now finally better on SummEval!