carriex / lfqa_eval

ACL 2023 paper "A Critical Evaluation of Evaluations for Long-form Question Answering"
20 stars 1 forks source link

lfqa_eval

Introduction

This is the repository for annotated data and model for this paper:

Fangyuan Xu, Yixiao Song, Mohit Iyyer and Eunsol Choi. A Critical Evaluation of Evaluations for Long-form Question Answering. In: Proceedings of ACL. 2023. *= Equal Contribution.

We collected expert preferences for pairwise machine-generated and human-written long-form answers, together with free-form justification of why they prefer one answer than the other. We curated a collection of human preferences of long-form answers from previous work. We evaluated a suite of automatic metrics on this collection of human evaluation data. We release the human evaluation data as well as code for automatic evaluations.

Data

We release processed pairwise human preference data under preference_data/. This collection includes the expert annotations we collected and previously released human evaluation from these prior work:

Each example is a json with the following field:

Note:

The unprocessed data from the prior work can be found at:

Raw expert annotations

Raw expert annotations as well as annotations interface are under raw_annotation_data.

Automatic evaluation

For Self-BLEU, refer to this script.

Reference-based

Please refer to the respective repo for BertScore and BLEURT.

For BERTScore, we use the default roberta-large model for English (https://github.com/Tiiiger/bert_score) and report the maximal F1 BERTScore against the set of reference answers.

For BLEURT, we use the BLERUT-20 checkpoint.

(Question, answer) metics

Please refer to BARTScore repo for running BARTScor. We use facebook/bart-large-cnn which is fine-tuned on the CNN/DM dataset.

For RANKGEN, refer to RANKGEN, we use the question as the prefix and the entire answer paragraph as the suffix to rank. We use the RankGen-XL-all.

(Answer, reference) metric

Refer to the QAFactEval repo for downloading the models and setting up the environments. We provide a script run_qafacteval.py which can be used to run QAFactEval to check the answer against the reference documents.

Lerened metrics

We are cleaning the code to release the learned long-former based reward model. Please contact the author (fangyuan[at]utexas.edu) if you would like to test it on your own data.

Citation and contact

If you find our work helpful, please cite us as

@inproceedings{lfqa23,
author={Fangyuan Xu and Yixiao Song and Mohit Iyyer and Eunsol Choi},
Booktitle = {Association of Computational Linguistics},
Year = "2023",
Title={A Critical Evaluation of Evaluations for Long-form Question Answering},
}

📧 Please contact Fangyuan Xu at fangyuan[at]utexas.edu if you have any questions.