TRUE: Re-evaluating Factual Consistency Evaluation, Or Honovich+, N/A, the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering'22 #962
Grounded text generation systems often generate text that contains factualinconsistencies, hindering their real-world applicability. Automatic factualconsistency evaluation may help alleviate this limitation by acceleratingevaluation cycles, filtering inconsistent outputs and augmenting training data.While attracting increasing attention, such evaluation metrics are usuallydeveloped and evaluated in silo for a single task or dataset, slowing theiradoption. Moreover, previous meta-evaluation protocols focused on system-levelcorrelations with human annotations, which leave the example-level accuracy ofsuch metrics unclear. In this work, we introduce TRUE: a comprehensive surveyand assessment of factual consistency metrics on a standardized collection ofexisting texts from diverse tasks, manually annotated for factual consistency.Our standardization enables an example-level meta-evaluation protocol that ismore actionable and interpretable than previously reported correlations,yielding clearer quality measures. Across diverse state-of-the-art metrics and11 datasets we find that large-scale NLI and questiongeneration-and-answering-based approaches achieve strong and complementaryresults. We recommend those methods as a starting point for model and metricdevelopers, and hope TRUE will foster progress towards even better evaluationmethods.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)