Tongji-KGLLM / RAG-Survey

1.54k stars 114 forks source link

Consolidating evaluation scores across different approaches #18

Open Joshua-Yu opened 3 months ago

Joshua-Yu commented 3 months ago

Hi team,

First of all, thanks for putting all of these together! It is amazingly great coverage and depth on the subject. The best I have seen so far. Super well done!

I am wondering whether you have the plan to consolidate and publish scores of evaluation from original papers for referencing and tracking?

Best regards

Joshua

yunfan42 commented 3 months ago

Hi Joshua,

Thank you for your kind words and feedback!

We have been considering this aspect and are also actively working on it.

The evaluation and comparison of the experimental results of the RAG method have always been a missing aspect.

We first hope to summarize the effects of different RAG methods on some commonly used datasets. For example, in the OpenRAG Base "Dataset" table, we can see that Natural Question ”, HotpotQA, and TriviaQA are currently the most commonly used evaluation datasets for RAG, with 24, 17, and 17 papers respectively.

However, one of the difficulties is that even in the same dataset, there are still many differences in the experimental assumptions of RAG. For example, the selection of retrieval sources (even the version of Wikipedia dump), chunking strategies, and the amount of data used. Simply summarizing them together may lead to some misunderstandings.

Rest assured, we have been paying attention to RAG evaluation. Please stay tuned for the follow-up work of OpenRAG.

Yunfan Gao