Factual consistency evaluation is often conducted using Natural LanguageInference (NLI) models, yet these models exhibit limited success in evaluatingsummaries. Previous work improved such models with synthetic training data.However, the data is typically based on perturbed human-written summaries,which often differ in their characteristics from real model-generated summariesand have limited coverage of possible factual errors. Alternatively, largelanguage models (LLMs) have recently shown promising results in directlyevaluating generative tasks, but are too computationally expensive forpractical use. Motivated by these limitations, we introduce TrueTeacher, amethod for generating synthetic data by annotating diverse model-generatedsummaries using a LLM. Unlike prior work, TrueTeacher does not rely onhuman-written summaries, and is multilingual by nature. Experiments on the TRUEbenchmark show that a student model trained using our data, substantiallyoutperforms both the state-of-the-art model with similar capacity, and the LLMteacher. In a systematic study, we compare TrueTeacher to existing syntheticdata generation methods and demonstrate its superiority and robustness todomain-shift. Using the the mFACE dataset, we also show that our methodgeneralizes to multilingual scenarios. Finally, we release a large-scalesynthetic dataset with 1.4M examples generated using TrueTeacher.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)