EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.88k stars 1.84k forks source link

[Question] I'm failing to evaluate a model on a task. #2146

Closed jpiabrantes closed 3 months ago

jpiabrantes commented 3 months ago

I want to evaluate llama-3.1-instruct on the Math benchmark. The code runs without any error but it seems that the score each 0, which doesn't match what is on HF. Any help?

!lm_eval \
    --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct \
    --include_path /home/joaoabrantis/lm-evaluation-harness/lm_eval/tasks/leaderboard/math \
    --tasks leaderboard_math_hard

And after some processing I get:

hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 1
|                    Tasks                    |Version|Filter|n-shot|  Metric   |Value|   |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|----:|---|-----:|
|leaderboard_math_hard                        |N/A    |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_algebra_hard             |Yaml   |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_counting_and_prob_hard   |Yaml   |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_geometry_hard            |Yaml   |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_intermediate_algebra_hard|Yaml   |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_num_theory_hard          |Yaml   |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_prealgebra_hard          |Yaml   |none  |     4|exact_match|    0|±  |     0|
| - leaderboard_math_precalculus_hard         |Yaml   |none  |     4|exact_match|    0|±  |     0|

|       Groups        |Version|Filter|n-shot|  Metric   |Value|   |Stderr|
|---------------------|-------|------|-----:|-----------|----:|---|-----:|
|leaderboard_math_hard|N/A    |none  |     4|exact_match|    0|±  |     0|
baberabb commented 3 months ago

The leaderboard probably uses --apply_chat_template for instruct models (maybe also --fewshot_as_multiturn)

sorobedio commented 3 months ago
i Have same problem even when using --apply_chat_template but when I added --fewshot_as_multiturn in addition of the previous chat template argument I got this. Tasks Version Filter n-shot Metric Value Stderr
leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 1 none 4 exact_match 0.1042 ± 0.0175
- leaderboard_math_counting_and_prob_hard 1 none 4 exact_match 0.0163 ± 0.0115
- leaderboard_math_geometry_hard 1 none 4 exact_match 0.0227 ± 0.0130
- leaderboard_math_intermediate_algebra_hard 1 none 4 exact_match 0.0107 ± 0.0062
- leaderboard_math_num_theory_hard 1 none 4 exact_match 0.0390 ± 0.0156
- leaderboard_math_prealgebra_hard 1 none 4 exact_match 0.1140 ± 0.0229
- leaderboard_math_precalculus_hard 1 none 4 exact_match 0.0148 ± 0.0104

the results are quite low

jpiabrantes commented 3 months ago

On the leaderboard it says this model achieved 16%. On the model's card it says 51.9 with zero-shots!

It would be great to know which parameters were used on the leaderboard so that it gets easier to reproduce results.

haileyschoelkopf commented 3 months ago

cc @clefourrier any idea what might be going on here?

clefourrier commented 3 months ago

Hi! We actually have a section on reproducibility in our doc here, you can simply follow the steps there. You do indeed need the chat template and few shot multiturn params for chat/instruct models, I'll add these 2 to the docs.

The other thing we noticed when discussing these results with the Meta team is that the instruction tuning of the model makes it ignore in context learning: it's no longer able to follow the minerva answer format, hence why most answers count as false.

jpiabrantes commented 3 months ago

Hi @clefourrier thank you for looking into this.

It seems that @sorobedio used both those settings and still wasn't able to achieve the results displayed in the Leaderboard. So maybe something else is missing.

The instruction tuned model achieves a better score (0.16 on math raw) than the base model (0.05).

Aside: If anyone knows how Meta tested their models on MATH would love to give it a look!

clefourrier commented 3 months ago

Did he use our fork, per the instructions? Some of our latest change are not in the harness yet

baberabb commented 3 months ago

I'm getting similar results to those on the leaderboard, but not quite. Could be vllm handles stop sequences differently.

on main:

lm_eval -m vllm -model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct -tasks leaderboard_math_hard -b auto  --apply_chat_template --fewshot_as_multiturn --num_fewshot 4
Task VLLM LLM Leaderboard
- math_algebra_hard 0.3225 ± 0.0267 0.3257 ± 0.0268
- math_counting_and_prob_hard 0.0894 ± 0.0258 0.1301 ± 0.0305
- math_geometry_hard 0.0758 ± 0.0231 0.0455 ± 0.0182
- math_intermediate_algebra_hard 0.0250 ± 0.0093 0.0250 ± 0.0093
- math_num_theory_hard 0.1429 ± 0.0283 0.1364 ± 0.0277
- math_prealgebra_hard 0.3523 ± 0.0345 0.2850 ± 0.0326
- math_precalculus_hard 0.0519 ± 0.0192 0.0222 ± 0.0127
jpiabrantes commented 3 months ago

I was able to reproduce:

hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto (8)
|                    Tasks                    |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------------------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
| - leaderboard_math_algebra_hard             |      1|none  |     4|exact_match|↑  |0.3290|±  |0.0269|
| - leaderboard_math_counting_and_prob_hard   |      1|none  |     4|exact_match|↑  |0.1138|±  |0.0288|
| - leaderboard_math_geometry_hard            |      1|none  |     4|exact_match|↑  |0.0909|±  |0.0251|
|leaderboard_math_hard                        |N/A    |none  |     4|exact_match|↑  |0.1775|±  |0.0099|
| - leaderboard_math_intermediate_algebra_hard|      1|none  |     4|exact_match|↑  |0.0250|±  |0.0093|
| - leaderboard_math_num_theory_hard          |      1|none  |     4|exact_match|↑  |0.1883|±  |0.0316|
| - leaderboard_math_prealgebra_hard          |      1|none  |     4|exact_match|↑  |0.3472|±  |0.0344|
| - leaderboard_math_precalculus_hard         |      1|none  |     4|exact_match|↑  |0.0370|±  |0.0163|

|       Groups        |Version|Filter|n-shot|  Metric   |   |Value |   |Stderr|
|---------------------|-------|------|-----:|-----------|---|-----:|---|-----:|
|leaderboard_math_hard|N/A    |none  |     4|exact_match|↑  |0.1775|±  |0.0099|

Thank you for the help!

jpiabrantes commented 3 months ago

@baberabb I also found differences between HF and vllm. I don't know why. Maybe one loads the model in float32 and the other in float16?

baberabb commented 3 months ago

@baberabb I also found differences between HF and vllm. I don't know why. Maybe one loads the model in float32 and the other in float16?

I think both vLLM and HF default to the dtype in the model config

shizhediao commented 1 month ago

On the leaderboard it says this model achieved 16%. On the model's card it says 51.9 with zero-shots!

It would be great to know which parameters were used on the leaderboard so that it gets easier to reproduce results.

Hi, A simple question, does anyone know what leads to such a gap? It seems quite large. Are the datasets / num_shots the same?