Open StellaAthena opened 11 months ago
@clefourrier has been investigating the DROP benchmark and has discovered that it has some serious problems, as are detailed in our recent blog post together and her previous this issue https://github.com/EleutherAI/lm-evaluation-harness/issues/978. We've established that:
Would you mind pointing a link towards the said blog post?
We're still doing the last proofreading steps (we are not in the same timezones), it's going to be published today CET :sweat_smile:
@clefourrier has been investigating the DROP benchmark and has discovered that it has some serious problems, as are detailed in our recent blog post together and her previous this issue #978. We've established that:
Would you mind pointing a link towards the said blog post?
I jumped the gun a little bit... the blog post needs to reference this issue and this issue needs to reference the blog post and I figured nobody would notice if I pushed the issue a little early 😅
As soon as the blog post goes live I'll edit the link in!
Blog is up here
The official implementation makes systematic errors in how it handles preprocessing and grading that make it hihgly non-desirable to use.
Can you say more about this? What official implementation are you talking about, and what specifically is wrong with it? From the blog post, most of the errors mentioned looked like they were not related to the official scoring, they were related to the generation procedure.
And for the record, if there are problems with the "official" scoring, they should absolutely be fixed.
I may have traced through the various github comments enough to answer my question (this comment gave me confidence I was understanding the issue correctly).
I can tell you that the DROP metric was intended to be as close as possible to the SQuAD metric, but with better handling of numbers, and with an attempt to deal with sets of answers. The particular tokenization issue that's at the heart of the 10\n\nPassage
problem is just a carryover from using the original SQuAD implementation. At the time these datasets were made, we didn't really conceive of models that would return things involving newlines, and so the metrics didn't account for them. The tokenization should be fixed.
Well, actually, I take back the assertion that the tokenization came from SQuAD. I looked more closely at the SQuAD script and didn't find that tokenization in there. I also don't see that tokenization in the original evaluation code for DROP. I'm not sure at what point it got added to the computation in allennlp-models
, nor what code is actually backing the official AllenAI leaderboard (I don't have access to that code at this point, if it even matters anymore). But regardless, that tokenize method is clearly just wrong.
Hey Matt, thanks for coming in. This really isn't intended as a criticism of your paper so much as an observation that a system designed for a previous paradigm of language modeling research doesn't fit too well today. DROP is far from the only benchmark with this problem, but often times there's a consensus way to update the procedure (see, e.g., Winogrande). That consensus seems to be missing for DROP and is what we are looking to build here.
Can you say more about this? What official implementation are you talking about, and what specifically is wrong with it? From the blog post, most of the errors mentioned looked like they were not related to the official scoring, they were related to the generation procedure.
And for the record, if there are problems with the "official" scoring, they should absolutely be fixed.
Perhaps I'm using the word "scoring" here a little loosely, but I'm talking about the generation termination criteria and the answer extraction procedure.
I also don't see that tokenization in the original evaluation code for DROP. I'm not sure at what point it got added to the computation in
allennlp-models
, nor what code is actually backing the official AllenAI leaderboard (I don't have access to that code at this point, if it even matters anymore). But regardless, that tokenize method is clearly just wrong.
Your link to the "original DROP code" 404s for me. Is it a private repo?
Unfortunately the paper contains almost no algorithmic detail about the implementation so we were forced to rely heavily on the codebase. I had been assuming that the backend for the leaderboard would be using the same code that the leaderboard links to... do you think we should take seriously the idea that they might differ?
Yeah, no worries, I didn't take it as criticism of the paper, just making sure I understand the issue.
I looked even more deeply through the history, and I found that this is the first commit where I added the official DROP eval script to allennlp
. The tokenization issue is in that commit also. So somewhere between Dheeru's original implementation (in an apparently private repo) (which I thought was what backed the metrics in the original paper, but now I'm questioning that given what I see in the allennlp repo...) and the commit that I added, someone introduced that tokenization bug. There are several candidates for who did it, including me, and I won't try to throw anyone under the bus :). But I think it's pretty clearly a bug that should be fixed, owing to us not testing cases that we didn't expect to see at the time. And given what I see now in the history, I'm confident that it's that script that is behind the leaderboard. I looked too quickly earlier today.
As far as what you should do - my vote would be to fix the bug in the eval script, doing whatever is needed to have a sane evaluation. "Fairness" with prior work doesn't really matter if the evaluation is nonsensical in the first place. If I were still in control of the DROP leaderboard I would fix it there, too (well, really, I might declare that the leaderboard has served its purpose and make the test set public, but that's a separate issue...).
Hi, I want to loop in to see if I can help from AI2 internally.
From a quick skim, it seems there are two issues:
.
as a stopword token in the OpenLLM leaderboard generation, which breaks the generation of float numbers. Based on the nice analysis in the blog post, it seems using \n
is clearly a better choice. I could help fix 1 if @matt-gardner and @dDua think it's a good idea. But I think 2 needs to be fixed anyway from the OpenLLM leaderboard side, and if you use \n
as the stop token, fixing 1 won't make any difference.
Btw, the reference implementation we assumed was the original is this one, which contains the normalisation/tokenization on `,
|,
-`
def _tokenize(text: str) -> List[str]:
return re.split(" |-", text)
def _normalize_answer(text: str) -> str:
"""Lower text and remove punctuation, articles and extra whitespace."""
parts = [
_white_space_fix(_remove_articles(_normalize_number(_remove_punc(_lower(token)))))
for token in _tokenize(text)
]
parts = [part for part in parts if part.strip()]
normalized = " ".join(parts).strip()
return normalized
I think we should carefully set stop tokens to avoid prematurely cutting off answers. In addition to DROP, many other benchmarks implemented in EleutherAI/lm-evaluation-harness are concerning to me. For instance:
:
as a stop token (see link), yet it appears in many golden answers (see this, this, or this). Note that this benchmark is also featured in the Huggingface's Open LLM Leaderboard..
, ,
as stop tokens (see link), but these punctuation marks are also included in many answers such as J.G. Ballard
, Vancouver, Canada
, 202.456.1111
, etc.\n
as a stop token, yet many answers in the test split of GovReport
span multiple lines.Additionally, the choice of stop tokens should depend on the prompt templates. As mentioned in the blog, LLMs often generate their own questions after answering the initial query. Therefore, it is natural to truncate the answers using Passage:
as in the case of DROP. This is important when considering correct answers that might span multiple lines or paragraphs as in the case of SCROLLS.
The official code for GSM8k samples 100 tokens
It looks like TriviaQA doesn't define an official sampling procedure, as their code assumes you have the generations already and just does the evaluation.
I'm finding the official SCROLLS code a bit confusing but there isn't anything that is obviously a condition for stopping generation.
Sorry, a bit late to the party (EMNLP etc.) As I understand it there are tokenization\generation problems in some examples, but we can find which quite easily (or a superset at least). As we
Also, I don't understood this https://x.com/clefourrier/status/1730950900733112785?s=20 If the evaluation is faulty, how can we keep the "legacy" one, or even why. Back comparison was just invalidated and that is a good thing. A bit related to Thom And Adina (and someone else I think) work recently on updating errors in datasets. We suggested to share something to make consistent preprocessing and metrics easy, we still plan to do something like that so we can discuss (we use it internally for some time, could mature more, but probably this could be done in the open)
Why bother fixing in LM eval anything?
There is a big difference between:
Whether we chose to keep or not DROP in the leaderboard, it would still need to be fixed in the harness - and as we want the leaderboard to follow the harness implementation, we removed it from the leaderboard.
This also answers the tweet you pointed, where my phrasing was not very good (I should have used reference, instead of original implementaiton): we want people using the leaderboard to be able to compare their results to the reference implementation in the harness - if people publish papers with the harness evaluations, their results should be comparable to the leaderboard's.
I see the difference. If it is only for evaluation purposes 9K is excessive also for the harness, the fact that Matt happened to get this number of examples doesn't mean we need to use all of it, it is an arbitrary sample from the unlimited amount of examples he could have created. The evaluation would be correct in that way (for any example given). The other option is to work much harder and get the evaluation metric to work on every example, I don't see why the effort is worth it unless the metric is also used on other datsets or the fix is very easy (and we don't count the leaderboard in this decision that loses) or we assume the deletion systematically deletes something of special importance (e.g. floats are especially critical).
@clefourrier has been investigating the DROP benchmark and has discovered that it has some serious problems, as are detailed in our recent blog post together and her previous this issue https://github.com/EleutherAI/lm-evaluation-harness/issues/978. We've established that:
The purpose of this issue is to solicit input from the community on whether you use DROP to evaluate language models, how you use it, and if you have any input on how we should evaluate it going forward. We will definitely continue to support the official implementation, but this is a case where it seems like we should change the default implementation to something more reasonable.
In general, we following the following priority list for addressing concerns about prompting and other eval details:
These are guidelines and not rules, and can be overruled in special circumstances.
We try to prioritize agreement with the procedures used by other groups to decrease the harm when people inevitably compare runs across different papers despite our discouragement of the practice. Historically, we also prioritized the implementation from "Language Models are Few Shot Learners" as our original goal was specifically to compare results with that paper.