AkihikoWatanabe commented 1 year ago

https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation

AkihikoWatanabe commented 1 year ago

The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations; 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics; 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format; 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics; and 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

Translation (by gpt-3.5-turbo)

テキスト要約の評価メトリックスに関する包括的で最新の研究の不足と、評価プロトコルに関する合意の欠如が進展を妨げています。私たちは、要約評価方法の既存の欠点を以下の5つの側面で解決します。1）ニューラル要約モデルの出力と専門家およびクラウドソーシングされた人間の注釈を使用して、14の自動評価メトリックスを包括的かつ一貫した方法で再評価します。2）前述の自動評価メトリックスを使用して、23の最近の要約モデルを一貫してベンチマークにします。3）CNN / DailyMailニュースデータセットでトレーニングされたモデルによって生成された要約の最大のコレクションを統一された形式で提供します。4）要約モデルを幅広い自動メトリックスで評価するための拡張可能で統一されたAPIを提供するツールキットを実装して共有します。5）専門の審査員とクラウドソースの労働者によって注釈が付けられたCNN / Daily Mailデータセット上のモデル生成要約の最大かつ最も多様なコレクションを組み立てて共有します。この研究がテキスト要約のより完全な評価プロトコルの促進と、人間の判断とより関連性のある評価メトリックスの開発における研究の進展に役立つことを願っています。
Summary (by gpt-3.5-turbo)
テキスト要約の評価方法に関する包括的な研究と評価プロトコルの欠如が進展を妨げている。この研究では、自動評価メトリックスの再評価、要約モデルのベンチマーク、統一された形式での要約の提供、評価ツールキットの実装、そして注釈付きデータセットの共有など、5つの側面で問題を解決する。この研究は、テキスト要約の評価プロトコルの改善と関連性の高い評価メトリックスの開発に貢献することを目指している。

AkihikoWatanabe commented 1 year ago

自動評価指標が人手評価の水準に達しないことが示されており、結局のところROUGEを上回る自動性能指標はほとんどなかった。human judgmentsとのKendall;'s Tauを見ると、chrFがCoherenceとRelevance, METEORがFluencyで上回ったのみだった。また、LEAD-3はやはりベースラインとしてかなり強く、LEAD-3を上回ったのはBARTとPEGASUSだった。

AkihikoWatanabe / paper_notes

SummEval: Re-evaluating Summarization Evaluation, Fabbri+, TACL'21 #984

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)