Lightning-AI / litgpt

Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
https://lightning.ai
Apache License 2.0
7.97k stars 797 forks source link

Difference between latest lm-eval-harness and lit-gpt eval #848

Open ajtejankar opened 5 months ago

ajtejankar commented 5 months ago

Hi,

I was trying to evaluate the Pythia-160M model against some tasks in lm-eval-harness and noticed that the results produced by the code in lit-gpt/eval and the latest version of lm-eval-harness are different. Here're the outputs of the two commands.

Command

python eval/lm_eval_harness.py \
    --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
    --eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
    --save_filepath pythia_160m_results.json

Result

{
    "results": {
        "winogrande": {
            "acc": 0.5185477505919495,
            "acc_stderr": 0.014042813708888378
        },
        "boolq": {
            "acc": 0.43700305810397555,
            "acc_stderr": 0.008675365793227082
        },
        "openbookqa": {
            "acc": 0.152,
            "acc_stderr": 0.01607198236791175,
            "acc_norm": 0.248,
            "acc_norm_stderr": 0.019332342821239103
        },
        "hellaswag": {
            "acc": 0.28141804421429994,
            "acc_stderr": 0.0044877188433302805,
            "acc_norm": 0.3053176658036248,
            "acc_norm_stderr": 0.004596006250433537
        },
        "piqa": {
            "acc": 0.5979325353645266,
            "acc_stderr": 0.011439867127267531,
            "acc_norm": 0.5908596300326442,
            "acc_norm_stderr": 0.011471593460443312
        }
    },
    "versions": {
        "winogrande": 0,
        "boolq": 1,
        "openbookqa": 0,
        "hellaswag": 0,
        "piqa": 0
    },
    "config": {
        "model": "pythia-160m",
        "batch_size": 16,
        "device": "cuda:0",
        "num_fewshot": 0,
        "limit": null,
        "bootstrap_iters": 100000,
        "no_cache": true
    }
}

Command

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-160m \
    --tasks hellaswag,openbookqa,winogrande,boolq,piqa \
    --device cuda:0 \
    --batch_size 16

Result

|  Tasks   |Version|Filter|n-shot| Metric |Value |   |Stderr|
|----------|-------|------|-----:|--------|-----:|---|-----:|
|boolq     |Yaml   |none  |     0|acc     |0.5688|±  |0.0087|
|hellaswag |Yaml   |none  |     0|acc     |0.2838|±  |0.0045|
|          |       |none  |     0|acc_norm|0.3027|±  |0.0046|
|openbookqa|Yaml   |none  |     0|acc     |0.1500|±  |0.0160|
|          |       |none  |     0|acc_norm|0.2680|±  |0.0198|
|piqa      |Yaml   |none  |     0|acc     |0.6230|±  |0.0113|
|          |       |none  |     0|acc_norm|0.6192|±  |0.0113|
|winogrande|Yaml   |none  |     0|acc     |0.5130|±  |0.0140|

As you can see for some tasks like BoolQ and PIQA the results are quite different. I wonder what could cause such a big difference.

Best, Ajinkya

ajtejankar commented 5 months ago

I tried Phi-2 and TinyLLaMa models and they had similar accuracies between the two methods. So, it seems there is something off with Pythia evaluation.

Andrei-Aksionov commented 5 months ago

Hello @ajtejankar Around a month ago I pinned the version of lm-eval-harness (we had a problem with an update that had some breaking changes): https://github.com/Lightning-AI/lit-gpt/blob/5a8ec86a3977eabb416ee5d2a0eb600762212422/requirements-all.txt#L13

Try to run your tests again with the latest version:

git+https://github.com/EleutherAI/lm-evaluation-harness.git@master
ajtejankar commented 5 months ago

Hi @Andrei-Aksionov,

Thanks for the quick reply. The evaluation.md tutorial requires installing lm-eval-harness from the master branch, and since I followed it, I think all of my tests were done with the master branch. In any case, I ran the test again as per your suggestion, and the results didn't change overall.

Command

python eval/lm_eval_harness.py \
    --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
    --eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
    --batch_size 8 \
    --save_filepath pythia_160m_master_branch_results.json

Results

{
    "results": {
        "piqa": {
            "acc": 0.5875952121871599,
            "acc_stderr": 0.011485407152743142,
            "acc_norm": 0.6033732317736671,
            "acc_norm_stderr": 0.011413778810510459
        },
        "winogrande": {
            "acc": 0.5272296764009471,
            "acc_stderr": 0.014031631629827696
        },
        "boolq": {
            "acc": 0.43853211009174314,
            "acc_stderr": 0.008678720482001875
        },
        "openbookqa": {
            "acc": 0.176,
            "acc_stderr": 0.017047852020622267,
            "acc_norm": 0.256,
            "acc_norm_stderr": 0.01953692357474761
        },
        "hellaswag": {
            "acc": 0.2810197171878112,
            "acc_stderr": 0.00448578446857668,
            "acc_norm": 0.3042222664807807,
            "acc_norm_stderr": 0.0045913698532765316
        }
    },
    "versions": {
        "piqa": 0,
        "winogrande": 0,
        "boolq": 1,
        "openbookqa": 0,
        "hellaswag": 0
    },
    "config": {
        "model": "pythia-160m",
        "batch_size": 8,
        "device": "cuda:0",
        "num_fewshot": 0,
        "limit": null,
        "bootstrap_iters": 100000,
        "no_cache": true
    }
}
Andrei-Aksionov commented 5 months ago

The evaluation.md tutorial requires installing lm-eval-harness from the master branch, and since I followed it, I think all of my tests were done with the master branch.

It was my mistake. When I pinned the version of lm-eval-harness forgot to update the tutorial.

Anyway, the difference is indeed noticeable. It's a bit strange that the difference is only for one model.

I tried Phi-2 and TinyLLaMa models and they had similar accuracies between the two methods.

Have you tried only Pyhia-160m or the whole family of Pythia models? If not, could you also, for good measure, try to evaluate something similar in size to Phi-2 and TinyLLaMa? Maybe Pythia-1.4b?

Andrei-Aksionov commented 5 months ago

I did try myself running Pythia-160m and Pythia-1.4b. Also noticed a difference in output for 160m version, though it's not the same that you got (different packages maybe?). All with the latest code for both lm-eval and lit-gpt, the latest packages.

The commands were used the same.

# Lit-GPT
python eval/lm_eval_harness.py \
    --checkpoint_dir checkpoints/EleutherAI/[model] \
    --eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
    --save_filepath [model]_results.json

# lm-eval
lm_eval --model hf --model_args pretrained=EleutherAI/[model] \
    --tasks hellaswag,openbookqa,winogrande,boolq,piqa \
    --device cuda:0 \
    --batch_size 16

1. Pythia-160m.

Lit-GPT:

{
    "results": {
        "piqa": {
            "acc": 0.5941240478781284,
            "acc_stderr": 0.011457256809261778,
            "acc_norm": 0.5930359085963003,
            "acc_norm_stderr": 0.011462093919190168
        },
        "openbookqa": {
            "acc": 0.162,
            "acc_stderr": 0.016494123566423526,
            "acc_norm": 0.266,
            "acc_norm_stderr": 0.019780559675655493
        },
        "hellaswag": {
            "acc": 0.28291177056363276,
            "acc_stderr": 0.004494934025462341,
            "acc_norm": 0.30262895837482573,
            "acc_norm_stderr": 0.004584571102598111
        },
        "winogrande": {
            "acc": 0.5406471981057617,
            "acc_stderr": 0.014005973823825141
        },
        "boolq": {
            "acc": 0.43730886850152906,
            "acc_stderr": 0.008676043429497423
        }
    },
}

lm-eval:

Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.3835 ± 0.0085
hellaswag Yaml none 0 acc 0.2504 ± 0.0043
none 0 acc_norm 0.2507 ± 0.0043
openbookqa Yaml none 0 acc 0.2080 ± 0.0182
none 0 acc_norm 0.2420 ± 0.0192
piqa Yaml none 0 acc 0.5359 ± 0.0116
none 0 acc_norm 0.5299 ± 0.0116
winogrande Yaml none 0 acc 0.4862 ± 0.0140

2. Pythia-1.4b.

Lit-GPT:

{
    "results": {
        "openbookqa": {
            "acc": 0.214,
            "acc_stderr": 0.018359797502387025,
            "acc_norm": 0.33,
            "acc_norm_stderr": 0.021049612166134792
        },
        "boolq": {
            "acc": 0.6376146788990825,
            "acc_stderr": 0.00840730865586405
        },
        "hellaswag": {
            "acc": 0.40400318661621193,
            "acc_stderr": 0.004896952378506925,
            "acc_norm": 0.5202150965943039,
            "acc_norm_stderr": 0.004985701593897998
        },
        "piqa": {
            "acc": 0.7078346028291621,
            "acc_stderr": 0.010610252174513658,
            "acc_norm": 0.70620239390642,
            "acc_norm_stderr": 0.010627574080514818
        },
        "winogrande": {
            "acc": 0.5659037095501184,
            "acc_stderr": 0.013929882555694063
        }
    },
}

lm-eval:

Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.6287 ± 0.0085
hellaswag Yaml none 0 acc 0.4036 ± 0.0049
none 0 acc_norm 0.5199 ± 0.0050
openbookqa Yaml none 0 acc 0.2200 ± 0.0185
none 0 acc_norm 0.3280 ± 0.0210
piqa Yaml none 0 acc 0.7073 ± 0.0106
none 0 acc_norm 0.7116 ± 0.0106
winogrande Yaml none 0 acc 0.5730 ± 0.0139
ajtejankar commented 5 months ago

Hi @Andrei-Aksionov,

I ran 14M, 70M, 410M, and 1.4B models in addition to the 160M model and it seems there is something wrong with models smaller than 160M. The results of larger models are consistent between both Lit-GPT and lm-eval. Detailed results are below. I used exactly the same commands as above but with different model names, and added some code for better formatting of results.

Pythia-14M

1. Lit-GPT Task Metric Value
boolq acc 0.38
hellaswag acc 0.26
hellaswag acc_norm 0.26
openbookqa acc 0.19
openbookqa acc_norm 0.28
piqa acc 0.54
piqa acc_norm 0.54
winogrande acc 0.48
2. lm-eval Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.3798 ± 0.0085
hellaswag Yaml none 0 acc 0.2610 ± 0.0044
none 0 acc_norm 0.2590 ± 0.0044
openbookqa Yaml none 0 acc 0.1320 ± 0.0152
none 0 acc_norm 0.2760 ± 0.0200
piqa Yaml none 0 acc 0.5571 ± 0.0116
none 0 acc_norm 0.5571 ± 0.0116
winogrande Yaml none 0 acc 0.5020 ± 0.0141

Pythia-70M

1. Lit-GPT Task Metric Value
boolq acc 0.41
hellaswag acc 0.26
hellaswag acc_norm 0.27
openbookqa acc 0.17
openbookqa acc_norm 0.26
piqa acc 0.56
piqa acc_norm 0.56
winogrande acc 0.49
2. lm-eval Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.5232 ± 0.0087
hellaswag Yaml none 0 acc 0.2661 ± 0.0044
none 0 acc_norm 0.2749 ± 0.0045
openbookqa Yaml none 0 acc 0.1280 ± 0.0150
none 0 acc_norm 0.2480 ± 0.0193
piqa Yaml none 0 acc 0.5947 ± 0.0115
none 0 acc_norm 0.5909 ± 0.0115
winogrande Yaml none 0 acc 0.5272 ± 0.0140

Pythia-160M

1. Lit-GPT Task Metric Value
boolq acc 0.44
hellaswag acc 0.28
hellaswag acc_norm 0.30
openbookqa acc 0.18
openbookqa acc_norm 0.26
piqa acc 0.59
piqa acc_norm 0.60
winogrande acc 0.53
2. lm-eval Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.5688 ± 0.0087
hellaswag Yaml none 0 acc 0.2838 ± 0.0045
none 0 acc_norm 0.3027 ± 0.0046
openbookqa Yaml none 0 acc 0.1500 ± 0.0160
none 0 acc_norm 0.2680 ± 0.0198
piqa Yaml none 0 acc 0.6230 ± 0.0113
none 0 acc_norm 0.6192 ± 0.0113
winogrande Yaml none 0 acc 0.5130 ± 0.0140

Pythia-410M

1. Lit-GPT Task Metric Value
boolq acc 0.59
hellaswag acc 0.34
hellaswag acc_norm 0.40
openbookqa acc 0.18
openbookqa acc_norm 0.29
piqa acc 0.67
piqa acc_norm 0.67
winogrande acc 0.53
2. lm-eval Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.6089 ± 0.0085
hellaswag Yaml none 0 acc 0.3373 ± 0.0047
none 0 acc_norm 0.4057 ± 0.0049
openbookqa Yaml none 0 acc 0.1800 ± 0.0172
none 0 acc_norm 0.2940 ± 0.0204
piqa Yaml none 0 acc 0.6692 ± 0.0110
none 0 acc_norm 0.6692 ± 0.0110
winogrande Yaml none 0 acc 0.5375 ± 0.0140

Pythia-1.4B

1. Lit-GPT Task Metric Value
boolq acc 0.63
hellaswag acc 0.40
hellaswag acc_norm 0.52
openbookqa acc 0.22
openbookqa acc_norm 0.34
piqa acc 0.71
piqa acc_norm 0.71
winogrande acc 0.57
2. lm-eval Tasks Version Filter n-shot Metric Value Stderr
boolq Yaml none 0 acc 0.6315 ± 0.0084
hellaswag Yaml none 0 acc 0.4045 ± 0.0049
none 0 acc_norm 0.5204 ± 0.0050
openbookqa Yaml none 0 acc 0.2220 ± 0.0186
none 0 acc_norm 0.3320 ± 0.0211
piqa Yaml none 0 acc 0.7084 ± 0.0106
none 0 acc_norm 0.7095 ± 0.0106
winogrande Yaml none 0 acc 0.5738 ± 0.0139
Andrei-Aksionov commented 5 months ago

Hey @ajtejankar Thanks for such a report! It looks like there is a problem with smaller versions of Pythia model. Though, I don't know who does it wrongly: HF or Lit-GPT 😄.

I'll take a look at the code in Lit-GPT vs Hugginface Transformer a bit closer. But since large versions of models are the priority for this repo, I cannot say when it will happen.

Or if you want to dig in and contribute - it would be awesome.

ajtejankar commented 5 months ago

Hi @Andrei-Aksionov,

Sure, no worries. I am definitely planning to take a look. I don't think it should be too hard.

Thanks for the help!

carmocca commented 5 months ago

In https://github.com/Lightning-AI/lit-gpt/blob/main/tests/test_model.py#L18-L85 you'll find a test for the pythia model config comparing lit-gpt and huggingface.

Note that numerical difference is expected in 16-bit precision: https://github.com/Lightning-AI/lit-gpt/blob/main/tests/test_model.py#L31-L32. It would be interesting to rerun those tables using 32-bit precision

Andrei-Aksionov commented 5 months ago

It would be interesting to rerun those tables using 32-bit precision

If to set precision to float32 for tests on GPU - they pass successfully. Also tried the same with the "full-size" config for Pythia models from 14m to 1b. All tests are passed with float32.

With float16 the bigger the model the larger percentage of non-matching tensors and the larger the max abs difference. Which doesn't sync with the results obtained above - there the larger the model the more results are similar between lm-eval and lit-gpt.