dvlab-research / MR-GSM8K

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
MIT License
40 stars 0 forks source link

How to calculate the scores? #4

Closed zhangxjohn closed 6 months ago

zhangxjohn commented 6 months ago

I run the eval_open_source_models.py file and get the eval_results, a json file such as xxxx_xxx_eval_results.json, But I found that calculate_mr_score.py does not translate this json into the following result:

"MetaMath":{
    't1-tp': 1305,
    't1-tn': 166,
    't1-fp': (1573-166),
    't1-fn': (1427-1305),
    't1-recall': 1305/1427,
    't1-precision': 1305/(1305+1573-166),
    't2-accuracy': 22/1573,
    't3-accuracy': 6/1573,
    't3-accuracy-auto': 7/1573,         
}

scripts does not provide the program to calculate it, can you provide it? Or tell me how to calculate it?

Randolph-zeng commented 6 months ago

Hi John: Thanks for your interest in our work, and we apologize for the gap in our eval scripts that confused you. We have updated the calculate_mr_score.py script to mitigate the post-processing gap. Here is how it works:

Once you have run the eval scripts and got the response from the models, then our annotation should be suffice to determine if the evaluated model has successfully determined the solution correctness and first error step if applicable. However, for the error reason, you either need to manually annotate for yourself or utilize our gpt4 helper to determine if the error reason given by the evaluated model is aligned with the reasons given by our annotator.

The above process is how we collect the results in our eval_results folder. Given these eval results, you can now utilize the calculate_mr_score.py script to get the mr-score. Here is how it works: It will analyze the eval result file, utilize the ground truth annotation to gather all the statistics for the task1(determine solution correctness), task2(find the first error step) and task3(detemine the error reason). Then it will combine the stats from all three tasks and unify it under the mr-score.

To try out the latest mr-score calculation script, simply update the repo path in the main function, and run it. You should be able to get the same statistics as shown in the readme table.

I hope the new update can make it slightly easier for you to evaluate the results from your own model. If the above explanation or the code is not clear enough. Please kindly let us know and we will get back to you ASAP

zhangxjohn commented 6 months ago

Thank you. It's a really great job!