Evaluation of natural language generation (NLG) is complex andmulti-dimensional. Generated text can be evaluated for fluency, coherence,factuality, or any other dimensions of interest. Most frameworks that performsuch multi-dimensional evaluation require training on large manually orsynthetically generated datasets. In this paper, we study the efficacy of largelanguage models as multi-dimensional evaluators using in-context learning,obviating the need for large training datasets. Our experiments show thatin-context learning-based evaluators are competitive with learned evaluationframeworks for the task of text summarization, establishing state-of-the-art ondimensions such as relevance and factual consistency. We then analyze theeffects of factors such as the selection and number of in-context examples onperformance. Finally, we study the efficacy of in-context learning basedevaluators in evaluating zero-shot summaries written by large language modelssuch as GPT-3.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)