How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.

Translation (by gpt-3.5-turbo)

自動要約評価尺度が要約の品質に関する人間の判断をどれだけ正確に再現するかは、システムレベルの相関によって定量化されます。本研究では、評価尺度が実際のシステム評価にどのように使用されているかとの整合性において、システムレベルの相関の定義に矛盾がある2つの点を特定し、この不整合を修正するための変更を提案します。まず、現在の標準的な手法である人間によって評価された要約のサブセットではなく、全テストセットを使用して自動評価尺度のシステムスコアを計算します。この小さな変更によって、システムレベルの相関のより正確な推定が可能となることを示します。次に、実際のシナリオではよく見られる自動スコアのわずかな差によって分離されたシステムのペアに対してのみ相関を計算することを提案します。これにより、ROUGEと人間の判断の相関の最良の推定値が実際のシナリオではほぼ0であることを示すことができます。分析結果は、より高品質な人間の判断を収集し、システムスコアの差が小さい場合に自動評価尺度を改善する必要性を指摘しています。
Summary (by gpt-3.5-turbo)
本研究では、自動要約評価尺度のシステムレベルの相関に関する不整合を修正するための変更を提案しています。具体的には、全テストセットを使用して自動評価尺度のシステムスコアを計算し、実際のシナリオでよく見られる自動スコアのわずかな差によって分離されたシステムのペアに対してのみ相関を計算することを提案しています。これにより、より正確な相関推定と高品質な人間の判断の収集が可能となります。

AkihikoWatanabe / paper_notes

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics, Deutsch+, NAACL'22 #952

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)