Difference between latest lm-eval-harness and lit-gpt eval

ajtejankar commented 5 months ago

Hi,

I was trying to evaluate the Pythia-160M model against some tasks in lm-eval-harness and noticed that the results produced by the code in lit-gpt/eval and the latest version of lm-eval-harness are different. Here're the outputs of the two commands.

Command

python eval/lm_eval_harness.py \
    --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
    --eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
    --save_filepath pythia_160m_results.json

Result

{
    "results": {
        "winogrande": {
            "acc": 0.5185477505919495,
            "acc_stderr": 0.014042813708888378
        },
        "boolq": {
            "acc": 0.43700305810397555,
            "acc_stderr": 0.008675365793227082
        },
        "openbookqa": {
            "acc": 0.152,
            "acc_stderr": 0.01607198236791175,
            "acc_norm": 0.248,
            "acc_norm_stderr": 0.019332342821239103
        },
        "hellaswag": {
            "acc": 0.28141804421429994,
            "acc_stderr": 0.0044877188433302805,
            "acc_norm": 0.3053176658036248,
            "acc_norm_stderr": 0.004596006250433537
        },
        "piqa": {
            "acc": 0.5979325353645266,
            "acc_stderr": 0.011439867127267531,
            "acc_norm": 0.5908596300326442,
            "acc_norm_stderr": 0.011471593460443312
        }
    },
    "versions": {
        "winogrande": 0,
        "boolq": 1,
        "openbookqa": 0,
        "hellaswag": 0,
        "piqa": 0
    },
    "config": {
        "model": "pythia-160m",
        "batch_size": 16,
        "device": "cuda:0",
        "num_fewshot": 0,
        "limit": null,
        "bootstrap_iters": 100000,
        "no_cache": true
    }
}

Command

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-160m \
    --tasks hellaswag,openbookqa,winogrande,boolq,piqa \
    --device cuda:0 \
    --batch_size 16

Result

|  Tasks   |Version|Filter|n-shot| Metric |Value |   |Stderr|
|----------|-------|------|-----:|--------|-----:|---|-----:|
|boolq     |Yaml   |none  |     0|acc     |0.5688|±  |0.0087|
|hellaswag |Yaml   |none  |     0|acc     |0.2838|±  |0.0045|
|          |       |none  |     0|acc_norm|0.3027|±  |0.0046|
|openbookqa|Yaml   |none  |     0|acc     |0.1500|±  |0.0160|
|          |       |none  |     0|acc_norm|0.2680|±  |0.0198|
|piqa      |Yaml   |none  |     0|acc     |0.6230|±  |0.0113|
|          |       |none  |     0|acc_norm|0.6192|±  |0.0113|
|winogrande|Yaml   |none  |     0|acc     |0.5130|±  |0.0140|

As you can see for some tasks like BoolQ and PIQA the results are quite different. I wonder what could cause such a big difference.

Best, Ajinkya

ajtejankar commented 5 months ago

I tried Phi-2 and TinyLLaMa models and they had similar accuracies between the two methods. So, it seems there is something off with Pythia evaluation.

Andrei-Aksionov commented 5 months ago

Hello @ajtejankar Around a month ago I pinned the version of lm-eval-harness (we had a problem with an update that had some breaking changes): https://github.com/Lightning-AI/lit-gpt/blob/5a8ec86a3977eabb416ee5d2a0eb600762212422/requirements-all.txt#L13

Try to run your tests again with the latest version:

git+https://github.com/EleutherAI/lm-evaluation-harness.git@master

ajtejankar commented 5 months ago

Hi @Andrei-Aksionov,

Thanks for the quick reply. The evaluation.md tutorial requires installing lm-eval-harness from the master branch, and since I followed it, I think all of my tests were done with the master branch. In any case, I ran the test again as per your suggestion, and the results didn't change overall.

Command

python eval/lm_eval_harness.py \
    --checkpoint_dir checkpoints/EleutherAI/pythia-160m \
    --eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
    --batch_size 8 \
    --save_filepath pythia_160m_master_branch_results.json

Results

{
    "results": {
        "piqa": {
            "acc": 0.5875952121871599,
            "acc_stderr": 0.011485407152743142,
            "acc_norm": 0.6033732317736671,
            "acc_norm_stderr": 0.011413778810510459
        },
        "winogrande": {
            "acc": 0.5272296764009471,
            "acc_stderr": 0.014031631629827696
        },
        "boolq": {
            "acc": 0.43853211009174314,
            "acc_stderr": 0.008678720482001875
        },
        "openbookqa": {
            "acc": 0.176,
            "acc_stderr": 0.017047852020622267,
            "acc_norm": 0.256,
            "acc_norm_stderr": 0.01953692357474761
        },
        "hellaswag": {
            "acc": 0.2810197171878112,
            "acc_stderr": 0.00448578446857668,
            "acc_norm": 0.3042222664807807,
            "acc_norm_stderr": 0.0045913698532765316
        }
    },
    "versions": {
        "piqa": 0,
        "winogrande": 0,
        "boolq": 1,
        "openbookqa": 0,
        "hellaswag": 0
    },
    "config": {
        "model": "pythia-160m",
        "batch_size": 8,
        "device": "cuda:0",
        "num_fewshot": 0,
        "limit": null,
        "bootstrap_iters": 100000,
        "no_cache": true
    }
}

Andrei-Aksionov commented 5 months ago

The evaluation.md tutorial requires installing lm-eval-harness from the master branch, and since I followed it, I think all of my tests were done with the master branch.

It was my mistake. When I pinned the version of lm-eval-harness forgot to update the tutorial.

Anyway, the difference is indeed noticeable. It's a bit strange that the difference is only for one model.

I tried Phi-2 and TinyLLaMa models and they had similar accuracies between the two methods.

Have you tried only Pyhia-160m or the whole family of Pythia models? If not, could you also, for good measure, try to evaluate something similar in size to Phi-2 and TinyLLaMa? Maybe Pythia-1.4b?

Andrei-Aksionov commented 5 months ago

I did try myself running Pythia-160m and Pythia-1.4b. Also noticed a difference in output for 160m version, though it's not the same that you got (different packages maybe?). All with the latest code for both lm-eval and lit-gpt, the latest packages.

The commands were used the same.

# Lit-GPT
python eval/lm_eval_harness.py \
    --checkpoint_dir checkpoints/EleutherAI/[model] \
    --eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
    --save_filepath [model]_results.json

# lm-eval
lm_eval --model hf --model_args pretrained=EleutherAI/[model] \
    --tasks hellaswag,openbookqa,winogrande,boolq,piqa \
    --device cuda:0 \
    --batch_size 16

1. `Pythia-160m`.

Lit-GPT:

{
    "results": {
        "piqa": {
            "acc": 0.5941240478781284,
            "acc_stderr": 0.011457256809261778,
            "acc_norm": 0.5930359085963003,
            "acc_norm_stderr": 0.011462093919190168
        },
        "openbookqa": {
            "acc": 0.162,
            "acc_stderr": 0.016494123566423526,
            "acc_norm": 0.266,
            "acc_norm_stderr": 0.019780559675655493
        },
        "hellaswag": {
            "acc": 0.28291177056363276,
            "acc_stderr": 0.004494934025462341,
            "acc_norm": 0.30262895837482573,
            "acc_norm_stderr": 0.004584571102598111
        },
        "winogrande": {
            "acc": 0.5406471981057617,
            "acc_stderr": 0.014005973823825141
        },
        "boolq": {
            "acc": 0.43730886850152906,
            "acc_stderr": 0.008676043429497423
        }
    },
}

lm-eval:

Tasks	Version	Filter	Metric	Value		Stderr
boolq	Yaml	none	acc	0.3835	±	0.0085
hellaswag	Yaml	none	acc	0.2504	±	0.0043
		none	acc_norm	0.2507	±	0.0043
openbookqa	Yaml	none	acc	0.2080	±	0.0182
		none	acc_norm	0.2420	±	0.0192
piqa	Yaml	none	acc	0.5359	±	0.0116
		none	acc_norm	0.5299	±	0.0116
winogrande	Yaml	none	acc	0.4862	±	0.0140

2. `Pythia-1.4b`.

Lit-GPT:

{
    "results": {
        "openbookqa": {
            "acc": 0.214,
            "acc_stderr": 0.018359797502387025,
            "acc_norm": 0.33,
            "acc_norm_stderr": 0.021049612166134792
        },
        "boolq": {
            "acc": 0.6376146788990825,
            "acc_stderr": 0.00840730865586405
        },
        "hellaswag": {
            "acc": 0.40400318661621193,
            "acc_stderr": 0.004896952378506925,
            "acc_norm": 0.5202150965943039,
            "acc_norm_stderr": 0.004985701593897998
        },
        "piqa": {
            "acc": 0.7078346028291621,
            "acc_stderr": 0.010610252174513658,
            "acc_norm": 0.70620239390642,
            "acc_norm_stderr": 0.010627574080514818
        },
        "winogrande": {
            "acc": 0.5659037095501184,
            "acc_stderr": 0.013929882555694063
        }
    },
}

lm-eval:

Tasks	Version	Filter	Metric	Value		Stderr
boolq	Yaml	none	acc	0.6287	±	0.0085
hellaswag	Yaml	none	acc	0.4036	±	0.0049
		none	acc_norm	0.5199	±	0.0050
openbookqa	Yaml	none	acc	0.2200	±	0.0185
		none	acc_norm	0.3280	±	0.0210
piqa	Yaml	none	acc	0.7073	±	0.0106
		none	acc_norm	0.7116	±	0.0106
winogrande	Yaml	none	acc	0.5730	±	0.0139

ajtejankar commented 5 months ago

Hi @Andrei-Aksionov,

I ran 14M, 70M, 410M, and 1.4B models in addition to the 160M model and it seems there is something wrong with models smaller than 160M. The results of larger models are consistent between both Lit-GPT and lm-eval. Detailed results are below. I used exactly the same commands as above but with different model names, and added some code for better formatting of results.

Pythia-14M

1. Lit-GPT	Task	Metric
boolq	acc	0.38
hellaswag	acc	0.26
hellaswag	acc_norm	0.26
openbookqa	acc	0.19
openbookqa	acc_norm	0.28
piqa	acc	0.54
piqa	acc_norm	0.54
winogrande	acc	0.48

2. `lm-eval`	Tasks	Version	n-shot	Metric	Value
boolq	Yaml	none	acc	0.3798	±	0.0085
hellaswag	Yaml	none	acc	0.2610	±	0.0044
		none	acc_norm	0.2590	±	0.0044
openbookqa	Yaml	none	acc	0.1320	±	0.0152
		none	acc_norm	0.2760	±	0.0200
piqa	Yaml	none	acc	0.5571	±	0.0116
		none	acc_norm	0.5571	±	0.0116
winogrande	Yaml	none	acc	0.5020	±	0.0141

Pythia-70M

1. Lit-GPT	Task	Metric
boolq	acc	0.41
hellaswag	acc	0.26
hellaswag	acc_norm	0.27
openbookqa	acc	0.17
openbookqa	acc_norm	0.26
piqa	acc	0.56
piqa	acc_norm	0.56
winogrande	acc	0.49

2. `lm-eval`	Tasks	Version	n-shot	Metric	Value
boolq	Yaml	none	acc	0.5232	±	0.0087
hellaswag	Yaml	none	acc	0.2661	±	0.0044
		none	acc_norm	0.2749	±	0.0045
openbookqa	Yaml	none	acc	0.1280	±	0.0150
		none	acc_norm	0.2480	±	0.0193
piqa	Yaml	none	acc	0.5947	±	0.0115
		none	acc_norm	0.5909	±	0.0115
winogrande	Yaml	none	acc	0.5272	±	0.0140

Pythia-160M

1. Lit-GPT	Task	Metric
boolq	acc	0.44
hellaswag	acc	0.28
hellaswag	acc_norm	0.30
openbookqa	acc	0.18
openbookqa	acc_norm	0.26
piqa	acc	0.59
piqa	acc_norm	0.60
winogrande	acc	0.53

2. `lm-eval`	Tasks	Version	n-shot	Metric	Value
boolq	Yaml	none	acc	0.5688	±	0.0087
hellaswag	Yaml	none	acc	0.2838	±	0.0045
		none	acc_norm	0.3027	±	0.0046
openbookqa	Yaml	none	acc	0.1500	±	0.0160
		none	acc_norm	0.2680	±	0.0198
piqa	Yaml	none	acc	0.6230	±	0.0113
		none	acc_norm	0.6192	±	0.0113
winogrande	Yaml	none	acc	0.5130	±	0.0140

Pythia-410M

1. Lit-GPT	Task	Metric
boolq	acc	0.59
hellaswag	acc	0.34
hellaswag	acc_norm	0.40
openbookqa	acc	0.18
openbookqa	acc_norm	0.29
piqa	acc	0.67
piqa	acc_norm	0.67
winogrande	acc	0.53

2. `lm-eval`	Tasks	Version	n-shot	Metric	Value
boolq	Yaml	none	acc	0.6089	±	0.0085
hellaswag	Yaml	none	acc	0.3373	±	0.0047
		none	acc_norm	0.4057	±	0.0049
openbookqa	Yaml	none	acc	0.1800	±	0.0172
		none	acc_norm	0.2940	±	0.0204
piqa	Yaml	none	acc	0.6692	±	0.0110
		none	acc_norm	0.6692	±	0.0110
winogrande	Yaml	none	acc	0.5375	±	0.0140

Pythia-1.4B

1. Lit-GPT	Task	Metric
boolq	acc	0.63
hellaswag	acc	0.40
hellaswag	acc_norm	0.52
openbookqa	acc	0.22
openbookqa	acc_norm	0.34
piqa	acc	0.71
piqa	acc_norm	0.71
winogrande	acc	0.57

2. `lm-eval`	Tasks	Version	n-shot	Metric	Value
boolq	Yaml	none	acc	0.6315	±	0.0084
hellaswag	Yaml	none	acc	0.4045	±	0.0049
		none	acc_norm	0.5204	±	0.0050
openbookqa	Yaml	none	acc	0.2220	±	0.0186
		none	acc_norm	0.3320	±	0.0211
piqa	Yaml	none	acc	0.7084	±	0.0106
		none	acc_norm	0.7095	±	0.0106
winogrande	Yaml	none	acc	0.5738	±	0.0139

Andrei-Aksionov commented 5 months ago

Hey @ajtejankar Thanks for such a report! It looks like there is a problem with smaller versions of Pythia model. Though, I don't know who does it wrongly: HF or Lit-GPT 😄.

I'll take a look at the code in Lit-GPT vs Hugginface Transformer a bit closer. But since large versions of models are the priority for this repo, I cannot say when it will happen.

Or if you want to dig in and contribute - it would be awesome.

ajtejankar commented 5 months ago

Hi @Andrei-Aksionov,

Sure, no worries. I am definitely planning to take a look. I don't think it should be too hard.

Thanks for the help!

carmocca commented 5 months ago

In https://github.com/Lightning-AI/lit-gpt/blob/main/tests/test_model.py#L18-L85 you'll find a test for the pythia model config comparing lit-gpt and huggingface.

Note that numerical difference is expected in 16-bit precision: https://github.com/Lightning-AI/lit-gpt/blob/main/tests/test_model.py#L31-L32. It would be interesting to rerun those tables using 32-bit precision

Andrei-Aksionov commented 5 months ago

It would be interesting to rerun those tables using 32-bit precision

If to set precision to float32 for tests on GPU - they pass successfully. Also tried the same with the "full-size" config for Pythia models from 14m to 1b. All tests are passed with float32.

With float16 the bigger the model the larger percentage of non-matching tensors and the larger the max abs difference. Which doesn't sync with the results obtained above - there the larger the model the more results are similar between lm-eval and lit-gpt.

Lightning-AI / litgpt