AkihikoWatanabe commented 1 year ago

https://aclanthology.org/2021.emnlp-main.619/

AkihikoWatanabe commented 1 year ago

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted Q2, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of Q2 against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.

Translation (by gpt-3.5-turbo)

対話のためのニューラルな知識に基づく生成モデルは、しばしば依存する知識と事実的に矛盾した内容を生成し、信頼性が低く、適用範囲が制限されることがあります。抽象的な要約で事実的な整合性を評価する最近の研究に触発されて、自動的な質問生成と質問応答を使用した知識に基づく対話の事実的な整合性の自動評価尺度を提案します。私たちの尺度であるQ2は、以前の研究で行われていたトークンベースのマッチングではなく、自然言語推論（NLI）を使用して回答スパンを比較します。適切な評価を促進するために、Wizard-of-Wikipediaデータセットの対話システムの出力のための新しいデータセットを作成し、事実的な整合性の手動アノテーションを行います。このデータセットと他の2つのデータセットを使用して、Q2を他の尺度とメタ評価を行い、人間の判断との相関が一貫して高いことを示します。
Summary (by gpt-3.5-turbo)
本研究では、ニューラルな知識に基づく対話生成モデルの信頼性と適用範囲の制限についての問題を解決するため、自動的な質問生成と質問応答を使用した事実的な整合性の自動評価尺度を提案します。この尺度は、自然言語推論を使用して回答スパンを比較することで、以前のトークンベースのマッチングよりも優れた評価を行います。また、新しいデータセットを作成し、事実的な整合性の手動アノテーションを行い、他の尺度とのメタ評価を行いました。結果として、提案手法が人間の判断と高い相関を示しました。

AkihikoWatanabe commented 10 months ago

（knowledge-grounded; 知識に基づいた）対話に対するFactual ConsistencyをReference-freeで評価できるQGQA手法。機械翻訳やAbstractive Summarizationの分野で研究が進んできたが、対話では

対話履歴、個人の意見、ユーザに対する質問、そして雑談

といった外部知識に対するconsistencyが適切ではない要素が多く存在し、よりチャレンジングなタスクとなっている。また、そもそも対話タスクはopen-endedなタスクなため、Reference-basedな手法は現実的ではなく、Reference-freeな手法が必要と主張。

手法の概要としては以下。ユーザの発話からQuestion Generation (QG)を実施し、Question-Answer Candidate Pairを作成する。そして、生成したQuestionをベースとなる知識から回答させ（QA）、その回答結果とAnswer Candidateを比較することでFactual Consistencyを測定する。

AkihikoWatanabe / paper_notes

Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering, Honovich+, EMNLP'21 #966

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)