Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgment. We propose BLEURT, a learned evaluation metric for English based on BERT. BLEURT can model human judgment with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG data set. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

Translation (by gpt-3.5-turbo)

過去数年間、テキスト生成は大きな進歩を遂げてきました。しかし、評価指標は遅れており、最も一般的な選択肢（例：BLEUやROUGE）は人間の判断と相関が低い場合があります。私たちは、BERTをベースとした学習済みの評価指標であるBLEURTを提案します。BLEURTは、数千の可能性のあるバイアスのあるトレーニング例を用いて人間の判断をモデル化することができます。私たちのアプローチの重要な側面は、モデルの汎化を支援するために数百万の合成例を使用する新しい事前学習スキームです。BLEURTは、WMT Metrics共有タスクの過去3年間とWebNLGデータセットで最先端の結果を提供します。通常のBERTベースのアプローチとは異なり、トレーニングデータが少ない場合や分布外の場合でも優れた結果を生み出します。
Summary (by gpt-3.5-turbo)
BLEURTは、BERTをベースとした学習済みの評価指標であり、人間の判断と高い相関を持つことが特徴です。BLEURTは、数千のトレーニング例を使用してバイアスのある評価をモデル化し、数百万の合成例を使用してモデルの汎化を支援します。BLEURTは、WMT Metrics共有タスクとWebNLGデータセットで最先端の結果を提供し、トレーニングデータが少ない場合や分布外の場合でも優れた性能を発揮します。

AkihikoWatanabe / paper_notes

BLEURT: Learning Robust Metrics for Text Generation, Sellam+, ACL'20 #944

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)