URL

https://arxiv.org/pdf/2408.02666
Affiliations
- Tianlu Wang, N/A
- Ilia Kulikov, N/A
- Olga Golovneva, N/A
- Ping Yu, N/A
- Weizhe Yuan, N/A
- Jane Dwivedi-Yu, N/A
- Richard Yuanzhe Pang, N/A
- Maryam Fazel-Zarandi, N/A
- Jason Weston, N/A
- Xian Li, N/A
  Abstract
- Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.
  Translation (by gpt-4o-mini)
モデルベースの評価は、成功したモデル開発の中心にあり、トレーニングのための報酬モデルとして、また人間の評価の代替として機能します。このような評価者をトレーニングするための標準的なアプローチは、モデルの応答に対する大量の人間の好み判断を収集することですが、これはコストがかかり、モデルが改善されるにつれてデータが陳腐化します。本研究では、人間の注釈を使用せずに評価者を改善することを目的としたアプローチを提案します。これは、合成トレーニングデータのみを使用します。ラベルのない指示から始めて、私たちの反復的な自己改善スキームは、対照的なモデル出力を生成し、LLMを評価者としてトレーニングして推論の痕跡と最終的な判断を生成します。このトレーニングは、改善された予測を使用して各新しい反復で繰り返されます。ラベル付きの好みデータがない状態でも、私たちの自己学習評価者は、RewardBenchで強力なLLM（Llama3-70B-Instruct）のパフォーマンスを75.4から88.3（多数決で88.7）に向上させることができます。これは、GPT-4などの一般的に使用されるLLM評価者を上回り、ラベル付きの例でトレーニングされたトップパフォーマンスの報酬モデルと同等のパフォーマンスを示します。
Summary (by gpt-4o-mini)
本研究では、人間の注釈なしで評価者を改善するアプローチを提案。合成トレーニングデータを用い、自己改善スキームによりLLMを評価者としてトレーニング。これにより、RewardBenchでのLLMのパフォーマンスを75.4から88.3に向上させ、GPT-4を超える結果を達成。

AkihikoWatanabe / paper_notes

Self-Taught Evaluators, Tianlu Wang+, N/A, arXiv'24 #1464

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)