Closed jpiabrantes closed 3 months ago
The leaderboard probably uses --apply_chat_template
for instruct models (maybe also --fewshot_as_multiturn
)
i Have same problem even when using --apply_chat_template but when I added --fewshot_as_multiturn in addition of the previous chat template argument I got this. |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|---|
leaderboard_math_hard | N/A | ||||||||
- leaderboard_math_algebra_hard | 1 | none | 4 | exact_match | ↑ | 0.1042 | ± | 0.0175 | |
- leaderboard_math_counting_and_prob_hard | 1 | none | 4 | exact_match | ↑ | 0.0163 | ± | 0.0115 | |
- leaderboard_math_geometry_hard | 1 | none | 4 | exact_match | ↑ | 0.0227 | ± | 0.0130 | |
- leaderboard_math_intermediate_algebra_hard | 1 | none | 4 | exact_match | ↑ | 0.0107 | ± | 0.0062 | |
- leaderboard_math_num_theory_hard | 1 | none | 4 | exact_match | ↑ | 0.0390 | ± | 0.0156 | |
- leaderboard_math_prealgebra_hard | 1 | none | 4 | exact_match | ↑ | 0.1140 | ± | 0.0229 | |
- leaderboard_math_precalculus_hard | 1 | none | 4 | exact_match | ↑ | 0.0148 | ± | 0.0104 |
the results are quite low
On the leaderboard it says this model achieved 16%. On the model's card it says 51.9 with zero-shots!
It would be great to know which parameters were used on the leaderboard so that it gets easier to reproduce results.
cc @clefourrier any idea what might be going on here?
Hi! We actually have a section on reproducibility in our doc here, you can simply follow the steps there. You do indeed need the chat template and few shot multiturn params for chat/instruct models, I'll add these 2 to the docs.
The other thing we noticed when discussing these results with the Meta team is that the instruction tuning of the model makes it ignore in context learning: it's no longer able to follow the minerva answer format, hence why most answers count as false.
Hi @clefourrier thank you for looking into this.
It seems that @sorobedio used both those settings and still wasn't able to achieve the results displayed in the Leaderboard. So maybe something else is missing.
The instruction tuned model achieves a better score (0.16 on math raw) than the base model (0.05).
Aside: If anyone knows how Meta tested their models on MATH would love to give it a look!
Did he use our fork, per the instructions? Some of our latest change are not in the harness yet
I'm getting similar results to those on the leaderboard, but not quite. Could be vllm
handles stop sequences differently.
on main
:
lm_eval -m vllm -model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct -tasks leaderboard_math_hard -b auto --apply_chat_template --fewshot_as_multiturn --num_fewshot 4
Task | VLLM | LLM Leaderboard |
---|---|---|
- math_algebra_hard | 0.3225 ± 0.0267 | 0.3257 ± 0.0268 |
- math_counting_and_prob_hard | 0.0894 ± 0.0258 | 0.1301 ± 0.0305 |
- math_geometry_hard | 0.0758 ± 0.0231 | 0.0455 ± 0.0182 |
- math_intermediate_algebra_hard | 0.0250 ± 0.0093 | 0.0250 ± 0.0093 |
- math_num_theory_hard | 0.1429 ± 0.0283 | 0.1364 ± 0.0277 |
- math_prealgebra_hard | 0.3523 ± 0.0345 | 0.2850 ± 0.0326 |
- math_precalculus_hard | 0.0519 ± 0.0192 | 0.0222 ± 0.0127 |
I was able to reproduce:
hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto (8)
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
| - leaderboard_math_algebra_hard | 1|none | 4|exact_match|↑ |0.3290|± |0.0269|
| - leaderboard_math_counting_and_prob_hard | 1|none | 4|exact_match|↑ |0.1138|± |0.0288|
| - leaderboard_math_geometry_hard | 1|none | 4|exact_match|↑ |0.0909|± |0.0251|
|leaderboard_math_hard |N/A |none | 4|exact_match|↑ |0.1775|± |0.0099|
| - leaderboard_math_intermediate_algebra_hard| 1|none | 4|exact_match|↑ |0.0250|± |0.0093|
| - leaderboard_math_num_theory_hard | 1|none | 4|exact_match|↑ |0.1883|± |0.0316|
| - leaderboard_math_prealgebra_hard | 1|none | 4|exact_match|↑ |0.3472|± |0.0344|
| - leaderboard_math_precalculus_hard | 1|none | 4|exact_match|↑ |0.0370|± |0.0163|
| Groups |Version|Filter|n-shot| Metric | |Value | |Stderr|
|---------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard|N/A |none | 4|exact_match|↑ |0.1775|± |0.0099|
Thank you for the help!
@baberabb I also found differences between HF and vllm. I don't know why. Maybe one loads the model in float32 and the other in float16?
@baberabb I also found differences between HF and vllm. I don't know why. Maybe one loads the model in float32 and the other in float16?
I think both vLLM and HF default to the dtype in the model config
On the leaderboard it says this model achieved 16%. On the model's card it says 51.9 with zero-shots!
It would be great to know which parameters were used on the leaderboard so that it gets easier to reproduce results.
Hi, A simple question, does anyone know what leads to such a gap? It seems quite large. Are the datasets / num_shots the same?
I want to evaluate llama-3.1-instruct on the Math benchmark. The code runs without any error but it seems that the score each 0, which doesn't match what is on HF. Any help?
And after some processing I get: