This is the repository for annotated data and model for this paper:
Fangyuan Xu, Yixiao Song, Mohit Iyyer and Eunsol Choi. A Critical Evaluation of Evaluations for Long-form Question Answering. In: Proceedings of ACL. 2023. *= Equal Contribution.
We collected expert preferences for pairwise machine-generated and human-written long-form answers, together with free-form justification of why they prefer one answer than the other. We curated a collection of human preferences of long-form answers from previous work. We evaluated a suite of automatic metrics on this collection of human evaluation data. We release the human evaluation data as well as code for automatic evaluations.
We release processed pairwise human preference data under preference_data/
. This collection includes the expert annotations we collected and previously released human evaluation from these prior work:
Each example is a json with the following field:
question
: The question.answer_a
and answer_b
: the two answer paragraphs being compareddoc_a
and doc_b
(optional): the corresponding evidence documents, only available for WebGPT comparisons.answer_a_type
and answer_b_type
: corresponds to the name of the model which generated the answer, or human
for human-written answers.overall_preference
: Overall preference, value is -1 (answer_a
wins), 0 (tie) or 1 (answer_b
wins).coherence_preference
(optional): Coherence preference, value is -1 (answer_a
wins), 0 (tie) or 1 (answer_b
wins).factuality_preference
(optional): Factuality preference, value is -1 (answer_a
wins), 0 (tie) or 1 (answer_b
wins).justification
(optional): free-form justification of why the annotator prefers one answer over the other.Note:
The unprocessed data from the prior work can be found at:
Raw expert annotations as well as annotations interface are under raw_annotation_data
.
For Self-BLEU, refer to this script.
Please refer to the respective repo for BertScore and BLEURT.
For BERTScore, we use the default roberta-large
model for English (https://github.com/Tiiiger/bert_score) and report the maximal F1 BERTScore against the set of reference answers.
For BLEURT, we use the BLERUT-20
checkpoint.
Please refer to BARTScore repo for running BARTScor. We use facebook/bart-large-cnn
which is fine-tuned on the CNN/DM dataset.
For RANKGEN, refer to RANKGEN, we use the question as the prefix and the entire answer paragraph as the suffix to rank. We use the RankGen-XL-all.
Refer to the QAFactEval repo for downloading the models and setting up the environments. We provide a script run_qafacteval.py
which can be used to run QAFactEval to check the answer against the reference documents.
We are cleaning the code to release the learned long-former based reward model. Please contact the author (fangyuan[at]utexas.edu) if you would like to test it on your own data.
If you find our work helpful, please cite us as
@inproceedings{lfqa23,
author={Fangyuan Xu and Yixiao Song and Mohit Iyyer and Eunsol Choi},
Booktitle = {Association of Computational Linguistics},
Year = "2023",
Title={A Critical Evaluation of Evaluations for Long-form Question Answering},
}
📧 Please contact Fangyuan Xu at fangyuan[at]utexas.edu
if you have any questions.