AkihikoWatanabe commented 1 year ago

https://virtual2023.aclweb.org/paper_P3880.html#paper

AkihikoWatanabe commented 1 year ago

Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE.

Translation (by gpt-3.5-turbo)

既存の自動生成質問の品質評価メトリック（BLEU、ROUGE、BERTScore、BLEURTなど）は、候補質問と参照質問を比較し、候補質問と参照質問の間にかなりの語彙的な重複または意味的な類似性がある場合に高いスコアを提供します。このアプローチには2つの主な欠点があります。まず、高価な人手による参照質問が必要です。第二に、参照質問と高い語彙的または意味的な類似性を持たない可能性のある妥当な質問にペナルティを与えます。本論文では、文脈に基づいて候補質問の回答可能性に基づいた新しいメトリックRQUGEを提案します。このメトリックは、既存の文献からの事前学習モデルを使用した質問応答とスパンスコアリングモジュールで構成されており、追加のトレーニングなしで使用することができます。RQUGEは、参照質問に依存せずに人間の判断とより高い相関を持つことを示します。さらに、RQUGEはいくつかの敵対的な破壊に対してより堅牢であることが示されています。さらに、質問生成モデルによって生成され、RQUGEによって再順位付けされた合成データでファインチューニングすることで、QAモデルのドメイン外データセットでのパフォーマンスを大幅に向上させることができることを示します。
Summary (by gpt-3.5-turbo)
既存の質問評価メトリックにはいくつかの欠点がありますが、本研究では新しいメトリックRQUGEを提案します。RQUGEは文脈に基づいて候補質問の回答可能性を考慮し、参照質問に依存せずに人間の判断と高い相関を持つことが示されています。さらに、RQUGEは敵対的な破壊に対しても堅牢であり、質問生成モデルのファインチューニングにも有効です。これにより、QAモデルのドメイン外データセットでのパフォーマンスが向上します。

AkihikoWatanabe commented 1 year ago

概要

質問自動生成の性能指標（e.g. ROUGE, BERTScore）は、表層の一致、あるいは意味が一致した場合にハイスコアを与えるが、以下の欠点がある

人手で作成された大量のreference questionが必要
表層あるいは意味的に近くないが正しいquestionに対して、ペナルティが与えられてしまう => contextに対するanswerabilityによって評価するメトリック RQUGE を提案

similarity basedな指標では、Q1のような正しい質問でもlexical overlapがないと低いスコアを与えてしまう。また、Q2のようなreferenceの言い換えであっても、低いスコアとなってしまう。一方、reference basedな手法では、Q3のようにunacceptableになっているにもかかわらず、変化が微小であるためそれをとらえられないという問題がある。

手法概要

提案手法ではcontextとanswer spanが与えられたとき、Span Scorerと、QAモジュールを利用してacceptability scoreを計算することでreference-freeなmetricを実現する。 QAモデルは、Contextと生成されたQuestionに基づき、answer spanを予測する。提案手法ではT5ベースの手法であるUnifiedQAv2を利用する。 Span Scorer Moduleでは、予測されたanswer span, candidate question, context, gold spanに基づき、[1, 5]のスコアを予測する。提案手法では、encoder-only BERT-based model（提案手法ではRoBERTa）を用いる。

AkihikoWatanabe / paper_notes

RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question, ACL'23 #890

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

概要

手法概要