RAGEval is a novel framework designed for automatically generating evaluation datasets to assess the knowledge usage ability of different Large Language Models (LLMs) in various Retrieval-Augmented Generation (RAG) scenarios. Unlike existing RAG benchmarks that focus on general knowledge, RAGEval enables the creation of domain-specific factual queries, allowing for a more nuanced evaluation of RAG systems across different vertical domains.
rageval/evaluation
folder.dragonball_dataset
folder. The RAGEval pipeline is coming soon!ποΈ Flexible Schema Generation: Summarizes a schema from seed documents to capture domain-specific knowledge structures.
π Diverse Document Generation: Uses the schema to generate varied configurations and subsequently diverse documents across multiple domains.
β Comprehensive QA Pair Creation: Constructs question-answering pairs based on generated documents and configurations.
π Novel Evaluation Metrics: Introduces three new metrics - Completeness, Hallucination, and Irrelevance - for a more thorough assessment of RAG model responses.
π Multi-Domain Support: Covers various domains including finance, legal, and medical sectors in both Chinese and English languages.
RAGEval has been used to benchmark various LLMs and RAG configurations:
RAGEval provides a comprehensive framework for evaluating RAG systems in domain-specific scenarios, offering more nuanced insights than existing benchmarks. It highlights the potential for significant improvements in open-source models for RAG tasks.
Please cite the following paper if you find RAGEval helpful!
@misc{zhu2024ragevalscenariospecificrag,
title={RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework},
author={Kunlun Zhu and Yifan Luo and Dingling Xu and Ruobing Wang and Shi Yu and Shuo Wang and Yukun Yan and Zhenghao Liu and Xu Han and Zhiyuan Liu and Maosong Sun},
year={2024},
eprint={2408.01262},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.01262},
}