One core capability of Large Language Models (LLMs) is to follow naturallanguage instructions. However, the evaluation of such abilities is notstandardized: Human evaluations are expensive, slow, and not objectivelyreproducible, while LLM-based auto-evaluation is potentially biased or limitedby the ability of the evaluator LLM. To overcome these issues, we introduceInstruction-Following Eval (IFEval) for large language models. IFEval is astraightforward and easy-to-reproduce evaluation benchmark. It focuses on a setof "verifiable instructions" such as "write in more than 400 words" and"mention the keyword of AI at least 3 times". We identified 25 types of thoseverifiable instructions and constructed around 500 prompts, with each promptcontaining one or more verifiable instructions. We show evaluation results oftwo widely available LLMs on the market. Our code and data can be found athttps://github.com/google-research/google-research/tree/master/instruction_following_eval
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)