EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.44k stars 1.7k forks source link

Some questions on the DROP and WinoGrande Harness implementations #978

Closed clefourrier closed 9 months ago

clefourrier commented 10 months ago

Hi!

We're happy to share with you that we've extended the tasks of the Eleuther AI Harness that we are covering in the Open LLM leaderboard (full update and communications going to be public in a few hours).

We've added Drop, GSM8K and WinoGrande. Evaluating them on the 2000+ models already on the leaderboard took the equivalent of one year of GPU, making it quite a large effort (maybe one of the largest single run of the harness in the wild?).

Anyway we wanted to thank Eleuther and the community around the harness very much for making this work and library available to the community, this is such a great and useful resource for all :hugs:

By diving into all these results, we observed that a few of these new tasks were implemented in a way that was not exactly what we were expecting, so we are turning back to the wisdom of the Harness community to discuss what would be best to do.

DROP

Drop is a generative benchmark with exact match and f1 score computed on the bag of words of the normalized generation and gold reference. However, this normalization will in some cases ignore correct numerical answer when they are directly followed by a whitespace which is not a space. Let's look at an example.

Let's take the generation 10\n\nPassage: The 2011 census recorded a population of 1,001,360, an increase of 10, where the gold is 10. First, the sentence is tokenized on separators |, -, or `. There is no such separator for10\n\nPassage;, which is therefore considered one token - on the other hand, the ending10becomes its own token. Then, punctuation is removed, which leaves10\n\nPassage, and numbers are homogeneized (= every string that can be cast to float is considered a number and cast to float, then re-converted to string). At this step,10\n\nPassagestays as such, whereas the ending10becomes10.0. A lot of other normalization steps ensue (removing articles, removing other white spaces, ...) and our original example becomes10 passage 2011.0 census recorded population of 1001360.0 increase of 10.0. However, the score is not computed on the string, but on the bag of words extracted from the string, here{'recorded', 'population', 'increase', 'passage', 'census', 'of', '2011.0', '1001360.0', '10', '10.0'}, which is compared with the bow of the gold, also normalized in the above manner, hence going from10to{10.0}. The initial 10, before the\n\nPassage;, having become10, will not match10.0`.

TLDR: If a number is followed with any kind of whitespace other than a simple space, it will not pass through the number normalization (and not be cast to float), hence never match the gold if it is also a number (that got cast to float through the normalization). This makes this evaluation stricter than what we expected as to what models can predict.

If it something you think needs changing or is this the intended behavior? :eyes:

WinoGrande

WinoGrande is a loglikelihood benchmark. It uses a context and different choices associated to this context. Among those choices (2 in the case of winogrande) we compute the loglikelihhod independantly and check which one has the most chance to be genrated by the model. There is among those choices a correct answer, if the correct choice is has the hight chance of being gennerated then the model passes the test.

Here is an example of context / choices from the WinoGrande dataset.

Example Choice 1 Choice 2
I helped my sister find her gold necklace. She couldn't wear her woven necklace to the ball because the _ was so casual. woven necklace ball

The goal here is to determine which of the two choices has the most chance of being generated by the model where the _ is located.

One way to do it with log likelihood evals is to be to split the example into a context and a choice, compute the loglikelihood of the choices and pick the highest one. We would have expected it to be:

Context Choice 1 Choice 2
I helped my sister find her gold necklace. She couldn't wear her woven necklace to the ball because the ball was so casual. woven necklace was so casual.

However, the LM Eval Harness applies a different split.

Context 1 Context 2 Choice (compute of loglikelihhod)
I helped my sister find her gold necklace. She couldn't wear her woven necklace to the ball because the woven necklace I helped my sister find her gold necklace. She couldn't wear her woven necklace to the ball because the ball was so casual.

We were wondering why this split selection was chosen? :eyes: When comparing the two methods, we observed that scores are a bit higher with the Harness split, but the rankings seem preserved.

Model Harness split acc Expected split acc
GPT2 0.5012 0.5036
Llama-2-7b 0.7403 0.6938
Falcon 7b 0.7238 0.6732
Mistral 7b 0.7474 0.7443

Conclusion

That's it for our questions and observations!

Thank you very much for reading this issue, and again thank you for all the work which goes into this cool library :hugs:
If you feel like any of this is not expected behavior, we'll be delighted to give you a hand if you need for a fix - and if you feel the behavior is expected, that's also very good to know.

Phil209 commented 10 months ago

Some LLMs like Dolphin 2.1 are only scoring around 7.5 on DROP despite being much better at answering the questions than other LLMs with much higher DROP scores.

It's possible that verbose responses that take the additional step of explaining how the answer was derived are being incorrectly judged as false. I say this because I'm picking up on a pattern that "less intelligent" LLMs that give short naked answers are getting higher DROP scores than "more intelligent" LLMs that are actually answering more questions correctly but are giving more verbose explanatory answers.

cabreraalex commented 10 months ago

For what it's worth, on DROP I noticed some similar trends:

  1. Yi-34B gets every float-based answer wrong, with answers cutting off after the period. Not sure if this is a model issue or tokenization issue (click the float label + answer button on the left here: https://hub.zenoml.com/project/cabreraalex/DROP%20OpenLLM%20Leaderboard)
  2. The number + newline issue is present too, but not common (the answer number + space button)
  3. @Phil209 I think this might be somewhat correct. The longer the answer is with words that are not in the answer, the lower the F1 score will be (https://kierszbaumsamuel.medium.com/f1-score-in-nlp-span-based-qa-task-5b115a5e7d41). You can see this with the long output button on the Zeno project. Not sure if this is a feature or a bug... ideally short and sweet would be better :)

EDIT

Oh this is fun, I added Falcon-180B which had a terrible DROP score, and found similar trends:

StellaAthena commented 10 months ago

Thank you for opening this issue!

In general, we following the following priority list for addressing concerns about prompting and other eval details:

  1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
  2. If there is a clear and unambiguous official implementation, use that procedure.
  3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
  4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.

These are guidelines and not rules, and can be overruled in special circumstances.

We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.

Unfortunately we haven't always done a great job of documenting our implementation decisions. So the first steps to take are

  1. Assess if there's a widespread practice among LLM trainers.
  2. Compare our implementation to the official WinoGrande implementation can be found here and the one for DROP can be found here.
  3. See if there is records of why our implementation is the way it is, including by searching the GitHub history and our discord server.

If anyone would like to help out with this investigation, please feel free to tackle any of the next steps and post the results here.

StellaAthena commented 10 months ago

I couldn't sleep, so I decided to look into this. Note that according to https://github.com/EleutherAI/lm-evaluation-harness/pull/213, the current DROP implementation is allegedly the official one (perhaps with an issue regarding multiple spans).

Our WinoGrande implementation appears to be unchanged from its original implementation per its file. Looking at Brown et al. (2020), they have the following to say about their evaluation protocol: IMG_4608 IMG_4607 This description appears to be consistent with our implementation, so I think it's safe to say it's the source. The question remains whether there is widespread agreement about using another method. Unfortunately the official method likely is inapplicable as it was not developed for causal language models.

albertqjiang commented 10 months ago

Note that according to #213, the current DROP implementation is allegedly the official one (perhaps with an issue regarding multiple spans).

Our WinoGrande implementation appears to be unchanged from its original implementation per its file. Looking at Brown et al. (2020), they have the following to say about their evaluation protocol: IMG_4608 IMG_4607 This description appears to be consistent with our implementation, so I think it's safe to say it's the source. The question remains whether there is widespread agreement about using another method. Unfortunately the official method likely is inapplicable as it was not developed for causal language models.

With regards to the DROP benchmark, I fully agree with Stella's point above. Further, since the benchmark was not developed for causal language models, I don't think it's appropriate to include it in the Open LLM benchmark, in light of two things: (1) The benchmark in its official implementation is flawed for causal language models, evidenced by the top-ranking pretrained LLMs have such inconsistent performances (see attached photo); (2) Being one of the 7 benchmarks considered for the leaderboard, it carries so much weight and can skew the average points by quite a lot. It's not great when the average ranking is very similar to the ranking on an unsuitable benchmark.

I want to voice for the temporary removal of this benchmark, until the community agrees on a more suitable approach.

Screenshot 2023-11-13 at 22 42 35
Phil209 commented 10 months ago

I agree that DROP needs to be weighted differently. Perhaps using something like z-score transformation.

Bear in mind that I'm just a user who doesn't program beyond simple scripts or have a deep understanding of how LLMs work. But since multiple choice tests bottom out at ~25 with random guesses, and any 1b parameter model or larger that uses a simple web scrape and standard training methods, such as Falcon 1b, inevitably scores an average of 35+ on multiple choice tests (e.g. Falcon 1b scored 35.07, 63.56, 25.28, 35.96 on the previous four), the real-world range of multiple choice LLM tests is 35-100, with a 5 point gain on any of them resulting in a clearly evident boost in LLM performance.

So dropping a 0-100 test like DROP into the mix and applying a simple average with the other tests is almost randomly shuffling LLM rankings in a way that doesn't represent their overall knowledge, creativity or problem solving abilities. To make matters worse, when I shorten the passages and drop them in 0-shot the "smarter" LLMs that scored <10 on DROP are outperforming many that scored >40. So it really isn't measuring skills, such as comprehension, extraction and processing with things like math and reason, until after an LLM can precisely parse a very large expanse of text and output an answer in the exact format expected by the evaluator.

StellaAthena commented 10 months ago

In general I agree with @albertqjiang and @Phil209, though I think it would be better to have a discussion about what task(s) should be included in the benchmark on the HF page. This issue regards whether the tasks are implemented correctly. It appears that the answer is yes for Winogrande and probably for DROP, though I haven't had time to read through the code and test it yet.

Another funny thought is that, although we treat causal decoders as first-class citizens in this library we do now support encoder-decoder models and decoder-only models with a finetuned head on them. Perhaps we should re-introduce the original Winogrande for people who wish to study those types of models.

clefourrier commented 10 months ago

@StellaAthena Thank you for your thoughtful explanation of the Harness process for eval additions 🤗

I took some time to compare the DROP implementations (Harness vs AllenAI) side by side and can confirm that they are the same. Sadly the original repo has been archived so I can't ask them there why they made these normalization choices.

StellaAthena commented 10 months ago

@clefourrier is there anything else to discuss here then, or can I close this issue?