AkihikoWatanabe commented 1 year ago

https://aclanthology.org/2020.acl-main.454/

AkihikoWatanabe commented 1 year ago

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

Translation (by gpt-3.5-turbo)

ニューラル抽象的要約モデルは、ソースドキュメントと矛盾した内容を生成する傾向があります。既存の自動評価指標は、このような誤りを効果的に捉えることができません。私たちは、生成された要約の信頼性を評価する問題に取り組んでいます。まず、2つのデータセットで数多くのモデルの出力の信頼性に関する人間の注釈を収集しました。現在のモデルは、抽象性と信頼性の間にトレードオフがあることがわかりました。ソースドキュメントとの単語の重複が少ない出力ほど、信頼性が低くなる傾向があります。次に、最近の読解技術の進歩を活用した信頼性の自動質問応答（QA）ベースの評価指標であるFEQAを提案します。要約から生成された質問-回答ペアを入力とし、QAモデルがドキュメントから回答を抽出します。一致しない回答は、要約における信頼性の低い情報を示します。単語の重複、埋め込みの類似性、学習された言語理解モデルに基づく評価指標の中で、私たちのQAベースの指標は、特に抽象的な要約において、人間の信頼性スコアとの相関が非常に高いです。
Summary (by gpt-3.5-turbo)
ニューラル抽象的要約モデルの信頼性を評価するために、人間の注釈を収集し、信頼性の自動評価指標であるFEQAを提案した。FEQAは質問応答を利用して要約の信頼性を評価し、特に抽象的な要約において人間の評価と高い相関を示した。

AkihikoWatanabe commented 1 year ago

FEQA

AkihikoWatanabe commented 1 year ago

生成された要約からQuestionを生成する手法。precision-oriented

AkihikoWatanabe / paper_notes

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization, Durmus+, ACL'20 #991

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)