Closed clefourrier closed 9 months ago
Some LLMs like Dolphin 2.1 are only scoring around 7.5 on DROP despite being much better at answering the questions than other LLMs with much higher DROP scores.
It's possible that verbose responses that take the additional step of explaining how the answer was derived are being incorrectly judged as false. I say this because I'm picking up on a pattern that "less intelligent" LLMs that give short naked answers are getting higher DROP scores than "more intelligent" LLMs that are actually answering more questions correctly but are giving more verbose explanatory answers.
For what it's worth, on DROP I noticed some similar trends:
float label + answer
button on the left here: https://hub.zenoml.com/project/cabreraalex/DROP%20OpenLLM%20Leaderboard)answer number + space
button)long output
button on the Zeno project. Not sure if this is a feature or a bug... ideally short and sweet would be better :) EDIT
Oh this is fun, I added Falcon-180B which had a terrible DROP score, and found similar trends:
[0-9].[0-9]
), so it's wrong on every float number question.Thank you for opening this issue!
In general, we following the following priority list for addressing concerns about prompting and other eval details:
These are guidelines and not rules, and can be overruled in special circumstances.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.
Unfortunately we haven't always done a great job of documenting our implementation decisions. So the first steps to take are
If anyone would like to help out with this investigation, please feel free to tackle any of the next steps and post the results here.
I couldn't sleep, so I decided to look into this. Note that according to https://github.com/EleutherAI/lm-evaluation-harness/pull/213, the current DROP implementation is allegedly the official one (perhaps with an issue regarding multiple spans).
Our WinoGrande implementation appears to be unchanged from its original implementation per its file. Looking at Brown et al. (2020), they have the following to say about their evaluation protocol: This description appears to be consistent with our implementation, so I think it's safe to say it's the source. The question remains whether there is widespread agreement about using another method. Unfortunately the official method likely is inapplicable as it was not developed for causal language models.
Note that according to #213, the current DROP implementation is allegedly the official one (perhaps with an issue regarding multiple spans).
Our WinoGrande implementation appears to be unchanged from its original implementation per its file. Looking at Brown et al. (2020), they have the following to say about their evaluation protocol: This description appears to be consistent with our implementation, so I think it's safe to say it's the source. The question remains whether there is widespread agreement about using another method. Unfortunately the official method likely is inapplicable as it was not developed for causal language models.
With regards to the DROP benchmark, I fully agree with Stella's point above. Further, since the benchmark was not developed for causal language models, I don't think it's appropriate to include it in the Open LLM benchmark, in light of two things: (1) The benchmark in its official implementation is flawed for causal language models, evidenced by the top-ranking pretrained LLMs have such inconsistent performances (see attached photo); (2) Being one of the 7 benchmarks considered for the leaderboard, it carries so much weight and can skew the average points by quite a lot. It's not great when the average ranking is very similar to the ranking on an unsuitable benchmark.
I want to voice for the temporary removal of this benchmark, until the community agrees on a more suitable approach.
I agree that DROP needs to be weighted differently. Perhaps using something like z-score transformation.
Bear in mind that I'm just a user who doesn't program beyond simple scripts or have a deep understanding of how LLMs work. But since multiple choice tests bottom out at ~25 with random guesses, and any 1b parameter model or larger that uses a simple web scrape and standard training methods, such as Falcon 1b, inevitably scores an average of 35+ on multiple choice tests (e.g. Falcon 1b scored 35.07, 63.56, 25.28, 35.96 on the previous four), the real-world range of multiple choice LLM tests is 35-100, with a 5 point gain on any of them resulting in a clearly evident boost in LLM performance.
So dropping a 0-100 test like DROP into the mix and applying a simple average with the other tests is almost randomly shuffling LLM rankings in a way that doesn't represent their overall knowledge, creativity or problem solving abilities. To make matters worse, when I shorten the passages and drop them in 0-shot the "smarter" LLMs that scored <10 on DROP are outperforming many that scored >40. So it really isn't measuring skills, such as comprehension, extraction and processing with things like math and reason, until after an LLM can precisely parse a very large expanse of text and output an answer in the exact format expected by the evaluator.
In general I agree with @albertqjiang and @Phil209, though I think it would be better to have a discussion about what task(s) should be included in the benchmark on the HF page. This issue regards whether the tasks are implemented correctly. It appears that the answer is yes for Winogrande and probably for DROP, though I haven't had time to read through the code and test it yet.
Another funny thought is that, although we treat causal decoders as first-class citizens in this library we do now support encoder-decoder models and decoder-only models with a finetuned head on them. Perhaps we should re-introduce the original Winogrande for people who wish to study those types of models.
@StellaAthena Thank you for your thoughtful explanation of the Harness process for eval additions 🤗
I took some time to compare the DROP implementations (Harness vs AllenAI) side by side and can confirm that they are the same. Sadly the original repo has been archived so I can't ask them there why they made these normalization choices.
@clefourrier is there anything else to discuss here then, or can I close this issue?
Hi!
We're happy to share with you that we've extended the tasks of the Eleuther AI Harness that we are covering in the Open LLM leaderboard (full update and communications going to be public in a few hours).
We've added Drop, GSM8K and WinoGrande. Evaluating them on the 2000+ models already on the leaderboard took the equivalent of one year of GPU, making it quite a large effort (maybe one of the largest single run of the harness in the wild?).
Anyway we wanted to thank Eleuther and the community around the harness very much for making this work and library available to the community, this is such a great and useful resource for all :hugs:
By diving into all these results, we observed that a few of these new tasks were implemented in a way that was not exactly what we were expecting, so we are turning back to the wisdom of the Harness community to discuss what would be best to do.
DROP
Drop is a generative benchmark with exact match and f1 score computed on the bag of words of the normalized generation and gold reference. However, this normalization will in some cases ignore correct numerical answer when they are directly followed by a whitespace which is not a space. Let's look at an example.
Let's take the generation
10\n\nPassage: The 2011 census recorded a population of 1,001,360, an increase of 10
, where the gold is10
. First, the sentence is tokenized on separators|
,-
, or`. There is no such separator for
10\n\nPassage;, which is therefore considered one token - on the other hand, the ending
10becomes its own token. Then, punctuation is removed, which leaves
10\n\nPassage, and numbers are homogeneized (= every string that can be cast to float is considered a number and cast to float, then re-converted to string). At this step,
10\n\nPassagestays as such, whereas the ending
10becomes
10.0. A lot of other normalization steps ensue (removing articles, removing other white spaces, ...) and our original example becomes
10 passage 2011.0 census recorded population of 1001360.0 increase of 10.0. However, the score is not computed on the string, but on the bag of words extracted from the string, here
{'recorded', 'population', 'increase', 'passage', 'census', 'of', '2011.0', '1001360.0', '10', '10.0'}, which is compared with the bow of the gold, also normalized in the above manner, hence going from
10to
{10.0}. The initial 10, before the
\n\nPassage;, having become
10, will not match
10.0`.TLDR: If a number is followed with any kind of whitespace other than a simple space, it will not pass through the number normalization (and not be cast to float), hence never match the gold if it is also a number (that got cast to float through the normalization). This makes this evaluation stricter than what we expected as to what models can predict.
If it something you think needs changing or is this the intended behavior? :eyes:
WinoGrande
WinoGrande is a loglikelihood benchmark. It uses a context and different choices associated to this context. Among those choices (2 in the case of winogrande) we compute the loglikelihhod independantly and check which one has the most chance to be genrated by the model. There is among those choices a correct answer, if the correct choice is has the hight chance of being gennerated then the model passes the test.
Here is an example of context / choices from the WinoGrande dataset.
The goal here is to determine which of the two choices has the most chance of being generated by the model where the
_
is located.One way to do it with log likelihood evals is to be to split the example into a context and a choice, compute the loglikelihood of the choices and pick the highest one. We would have expected it to be:
However, the
LM Eval Harness
applies a different split.We were wondering why this split selection was chosen? :eyes: When comparing the two methods, we observed that scores are a bit higher with the Harness split, but the rankings seem preserved.
Conclusion
That's it for our questions and observations!
Thank you very much for reading this issue, and again thank you for all the work which goes into this cool library :hugs:
If you feel like any of this is not expected behavior, we'll be delighted to give you a hand if you need for a fix - and if you feel the behavior is expected, that's also very good to know.