allenai / open-instruct

Apache License 2.0
1.21k stars 166 forks source link

Add MATH evaluation #135

Closed danieljkim0118 closed 2 months ago

danieljkim0118 commented 6 months ago

Added evaluation scripts for the MATH (Hendrycks et al., 2021) dataset.

yizhongw commented 6 months ago

Looks good. Thanks @danieljkim0118! Have you tested the performance of some vanilla pretrained models and tulu models? I am planning to run some tests. It would be great if you have some numbers that I can compare to.

hamishivi commented 4 months ago

It would be good to merge this soon!

hamishivi commented 3 months ago

I fixed this up a bit - the download link didn't work, the prompt was a bit off, etc. I followed the prompt from the Minerva paper (https://arxiv.org/abs/2206.14858), which is what the Llama 3 team claims they used (https://github.com/meta-llama/llama3/blob/main/eval_details.md). They claim 30.0% accuracy with llama-3-8b-instruct (https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). Despite this, I get the following results when evaluating Llama-3-8b-Instruct:

Calculating accuracy...
Accuracy: 0.1562
Per-type accuracy:
Number Theory: 0.15
Prealgebra: 0.2824
Geometry: 0.1273
Precalculus: 0.0568
Intermediate Algebra: 0.0421
Algebra: 0.203
Counting & Probability: 0.1751

Other people seem to be having some issues with replicating the results too (https://github.com/meta-llama/llama3/issues/250), so perhaps let's wait a little to see if that issue gets resolved, or otherwise we can just merge and say we are using our setting. I'm also checking to see what the oe-eval tool gets. I'll revisit in a bit.

hamishivi commented 2 months ago

Added a fix for #152 and fixed up the prompt. Now I get ~22%, which is closer than before but still lagging a bit. @yulinggu-cs got ~28% with the oe-eval tool, so we should probably match up the prompt/setup more closely there. My initial attempts to do it today didn't seem to match performance.

hamishivi commented 2 months ago

Okay, turns out I also had to fix up the eval/answer normalization side of things to better match the minerva setting, and do a multi-turn CoT prompt to further improve scores a bit. We now get 27.5% with llama-3-8b-Instruct (see this beaker job). I consider this more or less close enough, since we don't have access to the exact prompt setting used to get the reported llama 3 numbers. This also roughly matches our oe-eval tool numbers (although note that they use a slightly different prompt).

I'm going to merge this now, since I think this is close enough. Note that I'm just going to have a CoT setting only, since MATH direct scores low, and typically I don't think its reported either way.