MATH Test Score reproduce acc=43.6

deepseek-ai / DeepSeek-Math

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

MIT License

783 stars 46 forks source link

MATH Test Score reproduce acc=43.6 #3

Closed GanjinZero closed 7 months ago

GanjinZero commented 7 months ago

MATH on 5000 problems. deepseek-math-7b-rl @ cot @ greedy @ max 512 tokens

Accuracy: 43.6 Non decode: 0.0 Level 1 344/437 78.72 Level 2 546/894 61.07 Level 3 585/1131 51.72 Level 4 467/1214 38.47 Level 5 240/1324 18.13

Algebra 764/1187 64.36 Counting & Probability 181/474 38.19 Geometry 173/479 36.12 Intermediate Algebra 170/903 18.83 Number Theory 217/540 40.19 Prealgebra 561/871 64.41 Precalculus 116/546 21.25

DeepSeekPH commented 7 months ago

Did you add this suffix at the end of each test cases?

English questions: {question}\nPlease reason step by step, and put your final answer within \boxed{}.

GanjinZero commented 7 months ago

Accuracy: 47.6

Level 1 369/437 84.44 Level 2 606/894 67.79 Level 3 637/1131 56.32 Level 4 509/1214 41.93 Level 5 260/1324 19.64

Algebra 820/1187 69.08 Counting & Probability 186/474 39.24 Geometry 194/479 40.5 Intermediate Algebra 197/903 21.82 Number Theory 222/540 41.11 Prealgebra 618/871 70.95 Precalculus 144/546 26.37

GanjinZero commented 7 months ago

Strong. Respect.

Wangpeiyi9979 commented 7 months ago

Hello, for testing, we set max_tokens to 1024. In addition, Extracting and assessing math answers can be complex, possibly causing inconsistencies in evaluation. Please use our evaluation script to reproduce the results of our paper.

haoxiongliu commented 6 months ago

It seems that there are some new corner cases in the output of deepseek-math series models.

After modifying my evaluation script, I am able to get a 50.50% for deepseek-math-7b-rl using the Minerva 4-shot prompts. This result is close to what they report.

But the remaining 1.2% gap is still a mystery. Their evaluation script seems to be valid at the first glance.