dvlab-research / MR-GSM8K

Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
MIT License
40 stars 0 forks source link

BUG: error_reason_correctness #3

Open zhangxjohn opened 7 months ago

zhangxjohn commented 7 months ago

In the line 63 of auto_grade_error_reasons.py, the code to_be_graded_data = [data for data in eval_data if data['error_reason_correctness'] != 'N/A'] has error_reason_correctness, but I run the eval_open_source_models.py, the output json does not have error_reason_correctness, why? this is a code Bug or not?

Randolph-zeng commented 7 months ago

oh. thanks for pointing this out. This "error_reason_correctness" is actually a manual labelled field by our annotator that decides the correctness of the error reason returned by the evaluated models. The 'auto_grade_error_reasons.py' is supposed to use GPT4 to replace the human efforts. However, in the paper, we use the human labelling to verify the GPT4 labelling correctness thus used this field for filtering. For your case, you can safely ignore this field, thanks : )