AkihikoWatanabe commented 1 year ago

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

Translation (by gpt-3.5-turbo)

抄録： NLG（自然言語生成）の評価の大部分は、BLEUなどの自動評価指標に依存しています。本論文では、システムやデータに依存しない新しい自動評価手法の必要性を提案します。我々は、最新の単語ベースの評価指標や新しい文法ベースの評価指標を含む幅広い指標を調査し、それらがデータ駆動型のエンドツーエンドのNLGによって生成されたシステムの出力の人間の判断を弱く反映していることを示します。また、評価指標の性能はデータとシステムに依存することも示します。それにもかかわらず、結果は自動評価指標がシステムレベルで信頼性があり、システムの開発をサポートすることができることを示唆しています。特に、システムが低いパフォーマンスを示すケースを見つけることができます。
Summary (by gpt-3.5-turbo)
NLGの評価には自動評価指標が使われているが、本研究ではシステムやデータに依存しない新しい評価手法の必要性を提案する。幅広い指標を調査し、それらがデータ駆動型のNLGによって生成されたシステムの出力の人間の判断を弱く反映していることを示す。また、評価指標の性能はデータとシステムに依存することも示すが、自動評価指標はシステムレベルで信頼性があり、システムの開発をサポートできることを示唆する。特に、低いパフォーマンスを示すケースを見つけることができる。

AkihikoWatanabe commented 1 year ago

既存のNLGのメトリックがhuman judgementsとのcorrelationがあまり高くないことを指摘した研究

AkihikoWatanabe / paper_notes

Why We Need New Evaluation Metrics for NLG, EMNLP'17 #989

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)