Automatically evaluating the quality of language generation is critical.Although recent learned metrics show high correlation with human judgement,these metrics can not explain their verdict or associate the scores withdefects in generated text. To address this limitation, we presentInstructScore, an explainable evaluation metric for text generation. Byharnessing both explicit human instruction and the implicit knowledge of GPT-4,we fine-tune a text evaluation metric based on LLaMA, producing both a scorefor generated text and a human readable diagnostic report. We evaluateInstructScore on a variety of generation tasks, including translation,captioning, data-to-text and commonsense generation. Experiments show that our7B model surpasses all other unsupervised metrics, including those based on175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without directsupervision from human-rated data, achieves performance levels on par withstate-of-the-art metrics like COMET22, which were fine-tuned on human ratings.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)