Closed YanshekWoo closed 2 months ago
Oh that seems like an issue. Thanks for making us aware of it!
I can confirm here that it is indded a problem:
np.mean([p.statistic for p in pearson_scores])
0.3045311382736237
np.mean(pearson_scores)
0.2878194120961996
That is really not what I would expect as the behavior.
@Muennighoff, this seems to be quite an old error. I am afraid it might have been present in the MTEB paper as well. This means that changing it will effectively outdate the results of MTEB (using the fixes version):
from __future__ import annotations
import mteb
tasks = mteb.get_tasks(tasks=["SummEval"])
tasks[0].load_data()
model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2")
eval = mteb.MTEB(tasks=tasks)
res = eval.run(model, overwrite_results=True)
# expected score: 30.81
res[0].get_score() # 0.259711030822813
A solution would be to create a legacy version of the dataset and then create a v2 of the dataset as well (SummEvalv2). That way, we would avoid outdating old results, but the old results would still contain the error. As it stands, SummEval has fairly low descriptive power, which might change given this fix.
Oh that seems like an issue. Thanks for making us aware of it!
I can confirm here that it is indded a problem:
np.mean([p.statistic for p in pearson_scores]) 0.3045311382736237 np.mean(pearson_scores) 0.2878194120961996
That is really not what I would expect as the behavior.
@Muennighoff, this seems to be quite an old error. I am afraid it might have been present in the MTEB paper as well. This means that changing it will effectively outdate the results of MTEB (using the fixes version):
from __future__ import annotations import mteb tasks = mteb.get_tasks(tasks=["SummEval"]) tasks[0].load_data() model = mteb.get_model("sentence-transformers/all-MiniLM-L6-v2") eval = mteb.MTEB(tasks=tasks) res = eval.run(model, overwrite_results=True) # expected score: 30.81 res[0].get_score() # 0.259711030822813
A solution would be to create a legacy version of the dataset and then create a v2 of the dataset as well (SummEvalv2). That way, we would avoid outdating old results, but the old results would still contain the error. As it stands, SummEval has fairly low descriptive power, which might change given this fix.
Thank you for your response.
I believe there might be a more effective way to evaluate the results. Currently, SummEval includes only 100 independent examples, with each example point derived from 16 pairs. This often leads to results that are not statistically significant (high p-value). Consequently, SummEval has a substantial impact on the final average scores for English.
Would it be better to calculate the Spearman correlation across all pairs (16 * 100)? Additionally, incorporating more datasets and examples could further enhance the evaluation process.
Consequently, SummEval has a substantial impact on the final average scores for English.
I am not quite sure that I see that? Removing SummEval results in a very similar ranking as almost all models obtain a score of SummEval ~29-31 on MTEB, which is averaged across 56 datasets.
We have a version of MTEB coming out soon which includes MTEB lite. This version is notably faster. It might be a good place to also remove SummEval given the errors.
Would it be better to calculate the Spearman correlation across all pairs (16 * 100)? Additionally, incorporating more datasets and examples could further enhance the evaluation process.
I think the first step is to fix the bug, but additional datasets would naturally be an improvement.
We have a version of MTEB coming out soon which includes MTEB lite. This version is notably faster. It might be a good place to also remove SummEval given the errors.
Thanks for your reply and contribution to the commutinity. Looking forward to your new work!
Thanks @YanshekWoo - I will just keep the issue open as a reminder
Thanks so much for finding this bug! Indeed it seems to invalidate SummEval scores in the MTEB paper as the bug was present from the beginning (https://github.com/embeddings-benchmark/mteb/blob/20c22a919ae07314e3f93f2ddd808e87b0c7dbff/mteb/evaluation/evaluators/SummarizationEvaluator.py ; the additional np.array does not change things) and in previous versions of scipy this was the same (https://docs.scipy.org/doc/scipy-1.9.0/reference/generated/scipy.stats.pearsonr.html).
It probably will have little impact on the actual averages as @KennethEnevoldsen explained but looking forward to see if scores are now finally better on SummEval!
Code: https://github.com/embeddings-benchmark/mteb/blob/3a3f9cf9688a6dff860788dc5fa4bf9942b8b512/mteb/evaluation/evaluators/SummarizationEvaluator.py#L160
I have noticed that the final score of SummaryEval is calculated from
np.mean
ofspearman_scores
.However, it seems that the items in
spearman_scores
are SignificanceResult objects, which contain both statistic and pvalue. An example ofspearman_scores
is as follows:When applying
np.mean
, it calculate the mean score of all the statistic and pvalue, instead of mean ofstatistic .I don't know if there is some misunderstanding of the code or the evaluation method. Please confirm and check this issue.