Open wtc9806 opened 3 months ago
Hi @wtc9806 , you can find that arc_challenge is a dataset that consists of questions with multiple choices. The current evaluation method is to match the predicted option with the golden label, which also means the accuracy of the predictions. In fact, the term accuracy
we used in the paper and the metric functions in the evaluation code are both consistent with Self-RAG.
Got it! Thank you!
Hi authors,
Thanks for the great job. I am a little bit confused about eval.py. In the paper, accuracy is used as the evaluation metric for arc_challenge, but in the actual code, match is indeed used as the metric. Are these two the same? When testing accuracy, why is there an output key in the data?
Thanks.