URL

https://arxiv.org/abs/2204.04991
Affiliations
- Or Honovich, N/A
- Roee Aharoni, N/A
- Jonathan Herzig, N/A
- Hagai Taitelbaum, N/A
- Doron Kukliansy, N/A
- Vered Cohen, N/A
- Thomas Scialom, N/A
- Idan Szpektor, N/A
- Avinatan Hassidim, N/A
- Yossi Matias, N/A
  Abstract
- Grounded text generation systems often generate text that contains factualinconsistencies, hindering their real-world applicability. Automatic factualconsistency evaluation may help alleviate this limitation by acceleratingevaluation cycles, filtering inconsistent outputs and augmenting training data.While attracting increasing attention, such evaluation metrics are usuallydeveloped and evaluated in silo for a single task or dataset, slowing theiradoption. Moreover, previous meta-evaluation protocols focused on system-levelcorrelations with human annotations, which leave the example-level accuracy ofsuch metrics unclear. In this work, we introduce TRUE: a comprehensive surveyand assessment of factual consistency metrics on a standardized collection ofexisting texts from diverse tasks, manually annotated for factual consistency.Our standardization enables an example-level meta-evaluation protocol that ismore actionable and interpretable than previously reported correlations,yielding clearer quality measures. Across diverse state-of-the-art metrics and11 datasets we find that large-scale NLI and questiongeneration-and-answering-based approaches achieve strong and complementaryresults. We recommend those methods as a starting point for model and metricdevelopers, and hope TRUE will foster progress towards even better evaluationmethods.
  Translation (by gpt-3.5-turbo)
グラウンデッドテキスト生成システムは、しばしば事実の矛盾を含むテキストを生成し、実世界での適用性を妨げています。自動的な事実の整合性評価は、評価サイクルを加速し、整合性のない出力をフィルタリングし、トレーニングデータを拡充することで、この制限を緩和するのに役立つ可能性があります。これらの評価メトリックは、注目を集めつつありますが、通常は単一のタスクやデータセットに対して独自に開発および評価され、その採用が遅れています。さらに、以前のメタ評価プロトコルは、システムレベルの人間の注釈との相関に焦点を当てており、そのようなメトリックの例レベルの正確さは不明です。本研究では、既存のさまざまなタスクの標準化されたコレクションから手動で事実の整合性の注釈が付けられたテキストを用いて、事実の整合性メトリックの包括的な調査と評価であるTRUEを紹介します。私たちの標準化により、以前に報告された相関よりも実行可能で解釈可能な例レベルのメタ評価プロトコルが可能となり、より明確な品質指標が得られます。さまざまな最先端のメトリックと11のデータセットを対象に行った結果、大規模なNLIおよび質問生成・回答ベースのアプローチが強力で補完的な結果を達成することがわかりました。私たちは、これらの方法をモデルおよびメトリックの開発者の出発点として推奨し、TRUEがさらなる評価方法の向上に向けた進歩を促すことを期待しています。
Summary (by gpt-3.5-turbo)
事実の整合性メトリックの包括的な調査と評価であるTRUEを紹介。さまざまな最先端のメトリックと11のデータセットを対象に行った結果、大規模なNLIおよび質問生成・回答ベースのアプローチが強力で補完的な結果を達成することがわかった。TRUEをモデルおよびメトリックの開発者の出発点として推奨し、さらなる評価方法の向上に向けた進歩を期待している。

AkihikoWatanabe / paper_notes

TRUE: Re-evaluating Factual Consistency Evaluation, Or Honovich+, N/A, the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering'22 #962

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)