Consolidating evaluation scores across different approaches

Hi Joshua,

Thank you for your kind words and feedback!

We have been considering this aspect and are also actively working on it.

The evaluation and comparison of the experimental results of the RAG method have always been a missing aspect.

We first hope to summarize the effects of different RAG methods on some commonly used datasets. For example, in the OpenRAG Base "Dataset" table, we can see that Natural Question ”, HotpotQA, and TriviaQA are currently the most commonly used evaluation datasets for RAG, with 24, 17, and 17 papers respectively.

However, one of the difficulties is that even in the same dataset, there are still many differences in the experimental assumptions of RAG. For example, the selection of retrieval sources (even the version of Wikipedia dump), chunking strategies, and the amount of data used. Simply summarizing them together may lead to some misunderstandings.

Rest assured, we have been paying attention to RAG evaluation. Please stay tuned for the follow-up work of OpenRAG.

Yunfan Gao

Tongji-KGLLM / RAG-Survey

Consolidating evaluation scores across different approaches #18