Closed taoari closed 1 year ago
you need to read this: https://huggingface.co/blog/evaluating-mmlu-leaderboard and obviously only the original implementation of MMLU is the valid one.
Hi, thanks for raising this as an issue, and apologies for any trouble this may have caused!
As mentioned by @vince62s , we ultimately believe that it's crucial to have parity between our implementations and the official implementation released by the authors, which is what was implemented by #497 . Though prompting and evaluation methods may be finicky and tend to affect results, in the end we must default to what is implemented by benchmark creators whereever possible in lieu of better options
The Open LLM Leaderboard team have been made aware of the update and are switching over their scoring for all models to use the new implementation, ensuring that all their models reflect the new implementation and serving as a reference The current system for dealing with updates to tasks is that we recommend reporting the task's version number--so post-PR #497 is task version 1, pre-PR 497 is task version 0 of MMLU, though this isn't perfect and may not always be reported.
We're working on soon releasing a new version of the library (in the big-refactor
branch) that will allow for more configurability and transparency of prompting setup, and intend to provide further recommendations on how to report results from those configured tasks transparently and clearly to avoid issues where different papers evaluate on different benchmark variants. We also hope to explore averaging across prompts and other methods that might be less sensitive to slight prompt modifications in the future.
Hope this helps explain!
@vince62s Thanks for pointing me to this blog, and it's nice to see the community is making an effort to unify this. But I disagree with the saying that "only the original implementation of MMLU is the valid one". Actually, in this blog, it proposed evaluating MMLU as an open question. It is actually human that makes the final judgment. All three different evaluation methods have their own meanings. But obviously, it will be simple if everyone goes with the original evaluation implementation.
In fact, I would vote for the HELM MMLU evaluation. First, the results of it are not significantly different from the original implementation. Second, the HELM MMLU evaluation does not depend on the log probs of LLMs. So, commercial LLMs like ChatGPT 3.5-turbo and 4.0 can also be evaluated and compared with Open Source LLMs.
@haileyschoelkopf Is it possible to also have something like hendycksTestHELM
also in the repo? So that we can evaluate not only open-source LLMs, but also commercial LLMs. Even for open-source LLMs, hendycksTestHELM can do the evaluation without loading the model, but only if an inference API is provided.
I would expect to see an LLM Leaderboard in the future :), but not just an Open LLM leaderboard. Since all arc_challenge, hellaswag, hendrycksTest, truthfulqa_mc are multiple-choice tasks. A HELM version of them can be standardized, so this is definitely possible.
Yes, I think supporting the hendrycks_test_*
and the HELM generation-style version under helm_hendrycks_test_*
seems like a good call for the main branch for the time being! @taoari if you are interested in contributing this variant, that would be greatly helpful, else I'll make time to do so very shortly!
We'll certainly support multiple variants in the future upcoming version release, but will likely still note the original implementation as "default".
@haileyschoelkopf Please go ahead with the implementation. Actually, I tried with rf._model_generate
to get the LLM-generated results but I got an error. Currently, I only have a general idea of the framework design, but not details, I think it would be much faster if you could implement it.
@haileyschoelkopf Just found that the evaluation can be done using rf.greedy_until
. I make the PR 620. Please have a check.
@haileyschoelkopf please read this: https://github.com/hendrycks/test/pull/13 I think lm-evaluation-harness is also concerned by what I outlined in there for BPE based models. in a nutshell, when we perform 5-shots we put a space between "Answer:" and the letter of the answer. when "testing" with the actual question we put "Answer:" then the model will generate a space + the letter. For sentencepiece models it is not an issue because logits["A"] is the same as logits[" A"] However for BPE models, they encode the space so we are gathering logits of "A", "B", "C", "D" instead ofr " A", " B", " C", " D". Even if apparently it does not make a score difference in "total" because of the randomness nature of MCQA, it is absolutely different.
Thanks for sharing, I had not seen that!
It seems like, based on the script provided by the MMLU authors (https://github.com/hendrycks/test/blob/master/evaluate.py), that for their original GPT-3 eval code they also suffer from this issue. So the options available to us are:
" A"
in context, check logits of " A"
)"A"
in context, check logits of "A"
)I guess if we're seeking to match the original code for MMLU, we should do 1. But this doesn't seem as meaningful as actually evaluating with a consistent setting.
Will try to run comparisons on this soon!
When things are done wrong I think they need to be corrected. Again it impacts only BPE based models. Llama is not impacted for instance (sentencepiecebased) Numbers are not way off, but I ran a comparison and the number of tests for which softmax("A", "B", "C", "D") is not equal to softmax(" A", " B", " C", " D") is very significant.
When things are done wrong I think they need to be corrected. Again it impacts only BPE based models. Llama is not impacted for instance (sentencepiecebased) Numbers are not way off, but I ran a comparison and the number of tests for which softmax("A", "B", "C", "D") is not equal to softmax(" A", " B", " C", " D") is very significant.
sigh this is yet another reason to be against religious adherence to the “official” implementation of things.
@vince62s something to put on your radar is that we (EleutherAI in general, but especially Hailey and I) have faced a lot of criticism recently for not using the official implementation from day 1. We originally decided that we would implement tasks in ways that we felt were sensible and consistent, but as this library has grown in use external to EleutherAI and profile (it just passed 100 citations, per Google scholar) there has been a lot of attention to the differences. Some people have gone as far as to publicly accuse us of deliberately committing fraud by seeking to implement evaluations that make our models look better.
Unfortunately, correcting people’s misunderstandings about our code base is an extremely time intensive and exhausting task that has brought me quite a lot of frustration and anguish. There are no good options from my point of view right now.
I fully understand your frustration. I have been spending the last week to sort out things and my 50 cents: making things right will 1) not change the llama score which has been the biggest flaw so far 2) not change all sentencepiece based models (llama, open llama) 3) correct things for BPE (Falcon, Redpajama, MPT, Xgen) without really altering the overall score (because of the ramdomness as explained. My feeling is that as time goes, when models will be more powerfull, scoring in a wrong way MAY impact in the future appreciation.
Also, I have looked only at MMLU but my guess is that it could also impact other MCQA benchmark (I have no clue about those).
At least the discussion is on..... :)
Just wanted to add a bit more mess to this :) The Flan T5 paper, reports MMLU 52.4, I recomputed it to 49.3 and my number is the same as in the paper "Flacuna" here https://arxiv.org/pdf/2307.02053.pdf
The reason is that in the Flan T5 paper they use different prompts as per appendix D
D.1 MMLU For five-shot MMLU, we use the “dev” set as few-shot exemplars. Here, we report the “validation” set perfor- mance of individual tasks in MMLU (see https://www.tensorflow.org/datasets/community_catalog/ huggingface/hendrycks_test). All MMLU results in this paper are on the “validation” set except Ta- ble 4 and Table 1, which are test set results. The prompts that we use for MMLU can be found at https: //github.com/jasonwei20/flan-2. The prompts for STEM datasets are from Lewkowycz et al. (2022), with minor formatting fixes.
As you can see, we are still in a grey area. (reminds me when 5 years ago there was 4 or 5 different BLEU scores for NMT until sacreBLEU aligned everyone)
I think that I've found a reason for abnormally low few-shot MMLU scores for gpt-neox20B using my fork of Hendrycks code thanks to the @vince62s. The reason is that for different tokenizers, few-shot samples that demonstrate endings like this "Answer: A"
shows the model different most probable tokens. For some tokenizers, standalone "A"
is equal to " A"
in context. And for some, it's not.
I hope that the illustration explains it better.
So in the case of llama we show examples with 319 in the end and then use logprob of token 319 to check if the answer is A (because we don't add a space at the end of the sample).
And in the neox case, we show examples with 329 and then use the logprob of token 34.
Resume - the original approach with one forward pass and selecting by logprob is not tokenizer-agnostic and should be abandoned in favor of a comparison of 4 fully tokenized samples.
The current MMLU implementation of lm-eval-harness remains correct as it does not rely on this logprob selection.
One final note: @ollmer is correct MMLU scores are in line with Original implementation EXCEPT that there are slight differences because Harness strips the leading spaces of each prompt (there are plenty) when OG leaves them in. Not a big deal except for some tasks (like formal_logic MPT7B 28.6 vs 19.8, go figure)
Beside this when we make sure the dtype is the same then it's fine.
Last but not least: Harness is really slow because it will make 4 times the full pass for each A/B/C/D even though the prompt (which can be very long) is the same for all except the last token. In the end it is ... 4 times slower, so if the folks at HF will process all their models it will take for ever.
PR497 https://github.com/EleutherAI/lm-evaluation-harness/pull/497 breaks the evaluation of MMLU, it would be great if this commit could be reverted.
There are 3 major leaderboard for LLM evalution:
All three have MMLU evaluation, but even for the same model, the MMLU score is different.
Open LLM Leaderboard probability is the most complete and important reference for LLM performance reference. Since the evaluation of LLMs is computationally expensive, it is generally not possible for normal users to evaluate an LLM by themselves. So, it would be great not to break the compatibility with it in order to compare the results on the Leaderboard. Otherwise, it simply makes the evaluation of MMLU not useable, the results can not be compared with any of these three.
Besides, I believe many authors are also using the pre-PR497 version to evaluate and publish the results in papers. As form here, some of the paper results match with the pre-PR497 version, but not after PR497. So, this is another reason not to break the compatibility.