URL

https://arxiv.org/abs/2311.07911
Affiliations
- Jeffrey Zhou, N/A
- Tianjian Lu, N/A
- Swaroop Mishra, N/A
- Siddhartha Brahma, N/A
- Sujoy Basu, N/A
- Yi Luan, N/A
- Denny Zhou, N/A
- Le Hou, N/A
  Abstract
- One core capability of Large Language Models (LLMs) is to follow naturallanguage instructions. However, the evaluation of such abilities is notstandardized: Human evaluations are expensive, slow, and not objectivelyreproducible, while LLM-based auto-evaluation is potentially biased or limitedby the ability of the evaluator LLM. To overcome these issues, we introduceInstruction-Following Eval (IFEval) for large language models. IFEval is astraightforward and easy-to-reproduce evaluation benchmark. It focuses on a setof "verifiable instructions" such as "write in more than 400 words" and"mention the keyword of AI at least 3 times". We identified 25 types of thoseverifiable instructions and constructed around 500 prompts, with each promptcontaining one or more verifiable instructions. We show evaluation results oftwo widely available LLMs on the market. Our code and data can be found athttps://github.com/google-research/google-research/tree/master/instruction_following_eval
  Translation (by gpt-3.5-turbo)
大規模言語モデル（LLMs）の主要な能力の1つは、自然言語の指示に従うことです。しかし、このような能力の評価は標準化されていません。人間による評価は高価で時間がかかり、客観的に再現できません。一方、LLMによる自動評価は、評価者のLLMの能力によってバイアスがかかる可能性があります。これらの問題を克服するために、大規模言語モデルのためのInstruction-Following Eval（IFEval）を導入します。IFEvalは、直感的で再現性のある評価ベンチマークです。これは、「400語以上で書く」といった「検証可能な指示」のセットに焦点を当てています。私たちは、25種類の検証可能な指示を特定し、それぞれの指示を含む約500のプロンプトを作成しました。市場で広く利用可能な2つのLLMの評価結果を示します。コードとデータは、https://github.com/google-research/google-research/tree/master/instruction_following_eval で入手できます。
Summary (by gpt-3.5-turbo)
大規模言語モデル（LLMs）の能力を評価するために、Instruction-Following Eval（IFEval）という評価ベンチマークが導入されました。IFEvalは、検証可能な指示に焦点を当てた直感的で再現性のある評価方法です。具体的には、25種類の検証可能な指示を特定し、それぞれの指示を含む約500のプロンプトを作成しました。この評価ベンチマークの結果は、GitHubで公開されています。

AkihikoWatanabe / paper_notes

Instruction-Following Evaluation for Large Language Models, Jeffrey Zhou+, N/A, arXiv'23 #1137

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)