conda create -n ragqa-arena python=3.10.0
pip install -r requirements.txt
Data can be downloaded here: https://drive.google.com/drive/folders/1qhKEbGgBfrPzqGvhnzVoTvi3a6tiMyDP?usp=sharing. Save them into the data/
folder.
Query with citation can be downloaded in data/annotations_{domain}_with_citation.jsonl
. The in-text citations are included in the field "faithful_answer_w_citation"
, in the form of [1, 2, 3]
, which indicate the corresponding index of the annotated gold documents, which are shown in the format of doc_id
that can be mapped to the doc_id
corresponding to RobustQA document source.
Citations are also availble in the form of list of list in "citation_numbers"
, which are listed in the same order of in-text citations.
There is a minor amount difference in data/annotations_{domain}_with_citation.jsonl
with previous version that does not have citation annotation (statistics shown below). Please proceed with the version that satisfy your use case.
Domain | Original All Samples | With Citation |
---|---|---|
Technology | 2115 | 2064 |
Lifestyle | 2212 | 2198 |
Recreation | 2096 | 2090 |
Writing | 2696 | 2694 |
Science | 1426 | 1404 |
FIQA | 3669 | 3570 |
We only provide new annotations for this project. For underlying corpus, you need to
data/ans_gen/
.Modify the script by adding your openai_key
.
bash generate_response_gpt.sh
Note that you will need up to 8 A100 GPUs, change models and set clm_new_token_trigger
accordlingly in this script in order to parse output properly.
bash generate_response_hf.sh
Output files are saved under data/pairwise_eval/
, which will be used as inputs to the next section Pairwise Evaluation
.
We select gpt-4-0125-preview
as our final evaluator.
Modify the script by adding your openai_key
.
bash run_pairwise_eval_lfrqa.sh
Modify the script by adding your openai_key
.
bash run_pairwise_eval_llms.sh
After running Section 2, you can run the following script to see RAG-QA Arena results.
python code/report_results.py
This script reports win and win+tie rate against LFRQA only.
python code/report_results.py --use_complete_pairs
This script reports win and win+tie rate for all comparison, and output an all_battles.json
file that can be used in this Google Colab: https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH to compute Elo Rating.
We've evaluated 11 leading models and their complete pairs for 1400 queries. Results can be downloaded from https://drive.google.com/drive/folders/1fnJ_PUd33go0BXq0ShY9ofkVYERpyzFu?usp=sharing. Run the following scripts to see results. Make sure results are in ./eval_results/
sub-folder.
Top 5 retrieved passages for 11 LLMs compared to all LFRQA answers.
python code/report_results.py \
--root eval_results/lfrqa/5_psgs/
Top 5 retrieved passages for 11 LLMs compared to all LFRQA answers + 1400 queries with complete LLM pairwise comparisons.
python code/report_results.py
--root eval_results/lfrqa/5_psgs/
--use_complete_pairs
--sub_dirs complete_pair_batch_1 complete_pair_batch_2
Top 10 retrieved passages for 11 LLMs compared to all LFRQA answers.
python code/report_results.py
--root eval_results/lfrqa/10_psgs/
In data/human_eval
, we share two types of evaluation results,
human_evaluations.json
includes human evaluations results. There are three judgments in "raw" and we take the majority vote as the final vote.{LFRQA|RQA}_{RQA|gpt-4|llama-3|mixtral-large}-eval_by_gpt4-0125-preview.json
is the LLM-based evaluators' results. In these files, you can find RobustQA and LFRQA's annotation, together with LLMs generated answers. We sampled 100 queries per domain, so 700 queries in total. For brevity, we do not repeat the specific script to generate model-based evaluation results for these queries. All details can be found in Section 2 above.To check results, simply run python code/compute_correlation.py
.
@article{han2024ragqaarena,
title={RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering},
author={Rujun Han and Yuhao Zhang and Peng Qi and Yumo Xu and Jenyuan Wang and Lan Liu and William Yang Wang and Bonan Min and Vittorio Castelli},
year={2024},
journal={arXiv preprint arXiv:2407.13998},
url={https://arxiv.org/abs/2407.13998}
}
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.