EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.86k stars 1.83k forks source link

A new DROP benchmark is needed #1050

Open StellaAthena opened 11 months ago

StellaAthena commented 11 months ago

@clefourrier has been investigating the DROP benchmark and has discovered that it has some serious problems, as are detailed in our recent blog post together and her previous this issue https://github.com/EleutherAI/lm-evaluation-harness/issues/978. We've established that:

  1. The current implementation matches the official one.
  2. The official implementation makes systematic errors in how it handles preprocessing and grading that make it hihgly non-desirable to use.

The purpose of this issue is to solicit input from the community on whether you use DROP to evaluate language models, how you use it, and if you have any input on how we should evaluate it going forward. We will definitely continue to support the official implementation, but this is a case where it seems like we should change the default implementation to something more reasonable.

In general, we following the following priority list for addressing concerns about prompting and other eval details:

  1. If there is widespread agreement among people who train LLMs, use the agreed upon procedure.
  2. If there is a clear and unambiguous official implementation, use that procedure.
  3. If there is widespread agreement among people who evaluate LLMs, use the agreed upon procedure.
  4. If there are multiple common implementations but not universal or widespread agreement, use our preferred option among the common implementations. As before, prioritize choosing from among the implementations found in LLM training papers.

These are guidelines and not rules, and can be overruled in special circumstances.

We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.

killawhale2 commented 11 months ago

@clefourrier has been investigating the DROP benchmark and has discovered that it has some serious problems, as are detailed in our recent blog post together and her previous this issue https://github.com/EleutherAI/lm-evaluation-harness/issues/978. We've established that:

Would you mind pointing a link towards the said blog post?

clefourrier commented 11 months ago

We're still doing the last proofreading steps (we are not in the same timezones), it's going to be published today CET :sweat_smile:

StellaAthena commented 11 months ago

@clefourrier has been investigating the DROP benchmark and has discovered that it has some serious problems, as are detailed in our recent blog post together and her previous this issue #978. We've established that:

Would you mind pointing a link towards the said blog post?

I jumped the gun a little bit... the blog post needs to reference this issue and this issue needs to reference the blog post and I figured nobody would notice if I pushed the issue a little early 😅

As soon as the blog post goes live I'll edit the link in!

clefourrier commented 11 months ago

Blog is up here

matt-gardner commented 11 months ago

The official implementation makes systematic errors in how it handles preprocessing and grading that make it hihgly non-desirable to use.

Can you say more about this? What official implementation are you talking about, and what specifically is wrong with it? From the blog post, most of the errors mentioned looked like they were not related to the official scoring, they were related to the generation procedure.

And for the record, if there are problems with the "official" scoring, they should absolutely be fixed.

matt-gardner commented 11 months ago

I may have traced through the various github comments enough to answer my question (this comment gave me confidence I was understanding the issue correctly).

I can tell you that the DROP metric was intended to be as close as possible to the SQuAD metric, but with better handling of numbers, and with an attempt to deal with sets of answers. The particular tokenization issue that's at the heart of the 10\n\nPassage problem is just a carryover from using the original SQuAD implementation. At the time these datasets were made, we didn't really conceive of models that would return things involving newlines, and so the metrics didn't account for them. The tokenization should be fixed.

matt-gardner commented 11 months ago

Well, actually, I take back the assertion that the tokenization came from SQuAD. I looked more closely at the SQuAD script and didn't find that tokenization in there. I also don't see that tokenization in the original evaluation code for DROP. I'm not sure at what point it got added to the computation in allennlp-models, nor what code is actually backing the official AllenAI leaderboard (I don't have access to that code at this point, if it even matters anymore). But regardless, that tokenize method is clearly just wrong.

StellaAthena commented 11 months ago

Hey Matt, thanks for coming in. This really isn't intended as a criticism of your paper so much as an observation that a system designed for a previous paradigm of language modeling research doesn't fit too well today. DROP is far from the only benchmark with this problem, but often times there's a consensus way to update the procedure (see, e.g., Winogrande). That consensus seems to be missing for DROP and is what we are looking to build here.

Can you say more about this? What official implementation are you talking about, and what specifically is wrong with it? From the blog post, most of the errors mentioned looked like they were not related to the official scoring, they were related to the generation procedure.

And for the record, if there are problems with the "official" scoring, they should absolutely be fixed.

Perhaps I'm using the word "scoring" here a little loosely, but I'm talking about the generation termination criteria and the answer extraction procedure.

I also don't see that tokenization in the original evaluation code for DROP. I'm not sure at what point it got added to the computation in allennlp-models, nor what code is actually backing the official AllenAI leaderboard (I don't have access to that code at this point, if it even matters anymore). But regardless, that tokenize method is clearly just wrong.

Your link to the "original DROP code" 404s for me. Is it a private repo?

Unfortunately the paper contains almost no algorithmic detail about the implementation so we were forced to rely heavily on the codebase. I had been assuming that the backend for the leaderboard would be using the same code that the leaderboard links to... do you think we should take seriously the idea that they might differ?

matt-gardner commented 11 months ago

Yeah, no worries, I didn't take it as criticism of the paper, just making sure I understand the issue.

I looked even more deeply through the history, and I found that this is the first commit where I added the official DROP eval script to allennlp. The tokenization issue is in that commit also. So somewhere between Dheeru's original implementation (in an apparently private repo) (which I thought was what backed the metrics in the original paper, but now I'm questioning that given what I see in the allennlp repo...) and the commit that I added, someone introduced that tokenization bug. There are several candidates for who did it, including me, and I won't try to throw anyone under the bus :). But I think it's pretty clearly a bug that should be fixed, owing to us not testing cases that we didn't expect to see at the time. And given what I see now in the history, I'm confident that it's that script that is behind the leaderboard. I looked too quickly earlier today.

As far as what you should do - my vote would be to fix the bug in the eval script, doing whatever is needed to have a sane evaluation. "Fairness" with prior work doesn't really matter if the evaluation is nonsensical in the first place. If I were still in control of the DROP leaderboard I would fix it there, too (well, really, I might declare that the leaderboard has served its purpose and make the test set public, but that's a separate issue...).

yizhongw commented 11 months ago

Hi, I want to loop in to see if I can help from AI2 internally.

From a quick skim, it seems there are two issues:

  1. Our original eval tokenization didn't consider the newline token, which was not a big deal in the reading comprehension context but became very common in evaluating LLMs.
  2. The use of . as a stopword token in the OpenLLM leaderboard generation, which breaks the generation of float numbers. Based on the nice analysis in the blog post, it seems using \n is clearly a better choice.

I could help fix 1 if @matt-gardner and @dDua think it's a good idea. But I think 2 needs to be fixed anyway from the OpenLLM leaderboard side, and if you use \n as the stop token, fixing 1 won't make any difference.

clefourrier commented 11 months ago

Btw, the reference implementation we assumed was the original is this one, which contains the normalisation/tokenization on `,|,-`

def _tokenize(text: str) -> List[str]:
    return re.split(" |-", text)

def _normalize_answer(text: str) -> str:
    """Lower text and remove punctuation, articles and extra whitespace."""

    parts = [
        _white_space_fix(_remove_articles(_normalize_number(_remove_punc(_lower(token)))))
        for token in _tokenize(text)
    ]
    parts = [part for part in parts if part.strip()]
    normalized = " ".join(parts).strip()
    return normalized
tangbinh commented 11 months ago

I think we should carefully set stop tokens to avoid prematurely cutting off answers. In addition to DROP, many other benchmarks implemented in EleutherAI/lm-evaluation-harness are concerning to me. For instance:

Additionally, the choice of stop tokens should depend on the prompt templates. As mentioned in the blog, LLMs often generate their own questions after answering the initial query. Therefore, it is natural to truncate the answers using Passage: as in the case of DROP. This is important when considering correct answers that might span multiple lines or paragraphs as in the case of SCROLLS.

StellaAthena commented 11 months ago

The official code for GSM8k samples 100 tokens

It looks like TriviaQA doesn't define an official sampling procedure, as their code assumes you have the generations already and just does the evaluation.

I'm finding the official SCROLLS code a bit confusing but there isn't anything that is obviously a condition for stopping generation.

borgr commented 11 months ago

Sorry, a bit late to the party (EMNLP etc.) As I understand it there are tokenization\generation problems in some examples, but we can find which quite easily (or a superset at least). As we

  1. only care about evaluation
  2. have sunk cost of (2k models?) that already ran this
  3. think this evaluation is already costly Why bother fixing in LM eval anything? The solution of just ignoring those examples solves it all (also follows our recent efficient benchmarking). It is quite unlikely that we need 9K examples for any evaluation, one this is only a number in a list is even less so. We can happily live with say 1K. This saves a lot of compute forward. This can use the results we already ran, just need to compute the average again, with little fear that the new implementation requires rerunning everything.
borgr commented 11 months ago

Also, I don't understood this https://x.com/clefourrier/status/1730950900733112785?s=20 If the evaluation is faulty, how can we keep the "legacy" one, or even why. Back comparison was just invalidated and that is a good thing. A bit related to Thom And Adina (and someone else I think) work recently on updating errors in datasets. We suggested to share something to make consistent preprocessing and metrics easy, we still plan to do something like that so we can discuss (we use it internally for some time, could mature more, but probably this could be done in the open)

clefourrier commented 11 months ago

Why bother fixing in LM eval anything?

There is a big difference between:

Whether we chose to keep or not DROP in the leaderboard, it would still need to be fixed in the harness - and as we want the leaderboard to follow the harness implementation, we removed it from the leaderboard.

clefourrier commented 11 months ago

This also answers the tweet you pointed, where my phrasing was not very good (I should have used reference, instead of original implementaiton): we want people using the leaderboard to be able to compare their results to the reference implementation in the harness - if people publish papers with the harness evaluations, their results should be comparable to the leaderboard's.

borgr commented 11 months ago

I see the difference. If it is only for evaluation purposes 9K is excessive also for the harness, the fact that Matt happened to get this number of examples doesn't mean we need to use all of it, it is an arbitrary sample from the unlimited amount of examples he could have created. The evaluation would be correct in that way (for any example given). The other option is to work much harder and get the evaluation metric to work on every example, I don't see why the effort is worth it unless the metric is also used on other datsets or the fix is very easy (and we don't count the leaderboard in this decision that loses) or we assume the deletion systematically deletes something of special importance (e.g. floats are especially critical).