EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.86k stars 1.83k forks source link

Bug in Leaderboard IFEval Code #2260

Open noowad93 opened 2 months ago

noowad93 commented 2 months ago

https://github.com/EleutherAI/lm-evaluation-harness/blob/8138fd52437dcd8c76ac87bdc9d684840e794c42/lm_eval/tasks/leaderboard/ifeval/instructions.py#L1384

the updated IFEval dataset (https://www.oxen.ai/wis-k/instruction-following-eval/file/main/instruction-following-eval_train.parquet) now includes letters like "!", "#", which are considered true under the conditions mentioned. As a result, the letters may be changed to random characters.

haileyschoelkopf commented 2 months ago

ccing @NathanHB @clefourrier for discretion over changing the official leaderboard IFEval task definition!

2218 from @lewtun updated the dataset to google/ifeval for our ifeval non-leaderboard task--perhaps that should be carried over to the leaderboard variant regardless of whether that fixes this issue as well?