HuskyInSalt / CRAG

Corrective Retrieval Augmented Generation
244 stars 20 forks source link

CRAG_Metric #11

Open wtc9806 opened 3 months ago

wtc9806 commented 3 months ago

Hi authors,

Thanks for the great job. I am a little bit confused about eval.py. In the paper, accuracy is used as the evaluation metric for arc_challenge, but in the actual code, match is indeed used as the metric. Are these two the same? When testing accuracy, why is there an output key in the data?

Thanks. 微信图片_20240408083503

HuskyInSalt commented 3 months ago

Hi @wtc9806 , you can find that arc_challenge is a dataset that consists of questions with multiple choices. The current evaluation method is to match the predicted option with the golden label, which also means the accuracy of the predictions. In fact, the term accuracy we used in the paper and the metric functions in the evaluation code are both consistent with Self-RAG.

wtc9806 commented 2 months ago

Got it! Thank you!