Open ajtejankar opened 5 months ago
I tried Phi-2 and TinyLLaMa models and they had similar accuracies between the two methods. So, it seems there is something off with Pythia evaluation.
Hello @ajtejankar
Around a month ago I pinned the version of lm-eval-harness
(we had a problem with an update that had some breaking changes):
https://github.com/Lightning-AI/lit-gpt/blob/5a8ec86a3977eabb416ee5d2a0eb600762212422/requirements-all.txt#L13
Try to run your tests again with the latest version:
git+https://github.com/EleutherAI/lm-evaluation-harness.git@master
Hi @Andrei-Aksionov,
Thanks for the quick reply. The evaluation.md
tutorial requires installing lm-eval-harness
from the master branch, and since I followed it, I think all of my tests were done with the master branch. In any case, I ran the test again as per your suggestion, and the results didn't change overall.
Command
python eval/lm_eval_harness.py \
--checkpoint_dir checkpoints/EleutherAI/pythia-160m \
--eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
--batch_size 8 \
--save_filepath pythia_160m_master_branch_results.json
Results
{
"results": {
"piqa": {
"acc": 0.5875952121871599,
"acc_stderr": 0.011485407152743142,
"acc_norm": 0.6033732317736671,
"acc_norm_stderr": 0.011413778810510459
},
"winogrande": {
"acc": 0.5272296764009471,
"acc_stderr": 0.014031631629827696
},
"boolq": {
"acc": 0.43853211009174314,
"acc_stderr": 0.008678720482001875
},
"openbookqa": {
"acc": 0.176,
"acc_stderr": 0.017047852020622267,
"acc_norm": 0.256,
"acc_norm_stderr": 0.01953692357474761
},
"hellaswag": {
"acc": 0.2810197171878112,
"acc_stderr": 0.00448578446857668,
"acc_norm": 0.3042222664807807,
"acc_norm_stderr": 0.0045913698532765316
}
},
"versions": {
"piqa": 0,
"winogrande": 0,
"boolq": 1,
"openbookqa": 0,
"hellaswag": 0
},
"config": {
"model": "pythia-160m",
"batch_size": 8,
"device": "cuda:0",
"num_fewshot": 0,
"limit": null,
"bootstrap_iters": 100000,
"no_cache": true
}
}
The evaluation.md tutorial requires installing lm-eval-harness from the master branch, and since I followed it, I think all of my tests were done with the master branch.
It was my mistake. When I pinned the version of lm-eval-harness
forgot to update the tutorial.
Anyway, the difference is indeed noticeable. It's a bit strange that the difference is only for one model.
I tried Phi-2 and TinyLLaMa models and they had similar accuracies between the two methods.
Have you tried only Pyhia-160m
or the whole family of Pythia models?
If not, could you also, for good measure, try to evaluate something similar in size to Phi-2 and TinyLLaMa? Maybe Pythia-1.4b
?
I did try myself running Pythia-160m
and Pythia-1.4b
.
Also noticed a difference in output for 160m version, though it's not the same that you got (different packages maybe?). All with the latest code for both lm-eval and lit-gpt, the latest packages.
The commands were used the same.
# Lit-GPT
python eval/lm_eval_harness.py \
--checkpoint_dir checkpoints/EleutherAI/[model] \
--eval_tasks "[hellaswag,openbookqa,winogrande,boolq,piqa]" \
--save_filepath [model]_results.json
# lm-eval
lm_eval --model hf --model_args pretrained=EleutherAI/[model] \
--tasks hellaswag,openbookqa,winogrande,boolq,piqa \
--device cuda:0 \
--batch_size 16
Pythia-160m
.Lit-GPT:
{
"results": {
"piqa": {
"acc": 0.5941240478781284,
"acc_stderr": 0.011457256809261778,
"acc_norm": 0.5930359085963003,
"acc_norm_stderr": 0.011462093919190168
},
"openbookqa": {
"acc": 0.162,
"acc_stderr": 0.016494123566423526,
"acc_norm": 0.266,
"acc_norm_stderr": 0.019780559675655493
},
"hellaswag": {
"acc": 0.28291177056363276,
"acc_stderr": 0.004494934025462341,
"acc_norm": 0.30262895837482573,
"acc_norm_stderr": 0.004584571102598111
},
"winogrande": {
"acc": 0.5406471981057617,
"acc_stderr": 0.014005973823825141
},
"boolq": {
"acc": 0.43730886850152906,
"acc_stderr": 0.008676043429497423
}
},
}
lm-eval:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.3835 | ± | 0.0085 |
hellaswag | Yaml | none | 0 | acc | 0.2504 | ± | 0.0043 |
none | 0 | acc_norm | 0.2507 | ± | 0.0043 | ||
openbookqa | Yaml | none | 0 | acc | 0.2080 | ± | 0.0182 |
none | 0 | acc_norm | 0.2420 | ± | 0.0192 | ||
piqa | Yaml | none | 0 | acc | 0.5359 | ± | 0.0116 |
none | 0 | acc_norm | 0.5299 | ± | 0.0116 | ||
winogrande | Yaml | none | 0 | acc | 0.4862 | ± | 0.0140 |
Pythia-1.4b
.Lit-GPT:
{
"results": {
"openbookqa": {
"acc": 0.214,
"acc_stderr": 0.018359797502387025,
"acc_norm": 0.33,
"acc_norm_stderr": 0.021049612166134792
},
"boolq": {
"acc": 0.6376146788990825,
"acc_stderr": 0.00840730865586405
},
"hellaswag": {
"acc": 0.40400318661621193,
"acc_stderr": 0.004896952378506925,
"acc_norm": 0.5202150965943039,
"acc_norm_stderr": 0.004985701593897998
},
"piqa": {
"acc": 0.7078346028291621,
"acc_stderr": 0.010610252174513658,
"acc_norm": 0.70620239390642,
"acc_norm_stderr": 0.010627574080514818
},
"winogrande": {
"acc": 0.5659037095501184,
"acc_stderr": 0.013929882555694063
}
},
}
lm-eval:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.6287 | ± | 0.0085 |
hellaswag | Yaml | none | 0 | acc | 0.4036 | ± | 0.0049 |
none | 0 | acc_norm | 0.5199 | ± | 0.0050 | ||
openbookqa | Yaml | none | 0 | acc | 0.2200 | ± | 0.0185 |
none | 0 | acc_norm | 0.3280 | ± | 0.0210 | ||
piqa | Yaml | none | 0 | acc | 0.7073 | ± | 0.0106 |
none | 0 | acc_norm | 0.7116 | ± | 0.0106 | ||
winogrande | Yaml | none | 0 | acc | 0.5730 | ± | 0.0139 |
Hi @Andrei-Aksionov,
I ran 14M, 70M, 410M, and 1.4B models in addition to the 160M model and it seems there is something wrong with models smaller than 160M. The results of larger models are consistent between both Lit-GPT and lm-eval
. Detailed results are below. I used exactly the same commands as above but with different model names, and added some code for better formatting of results.
1. Lit-GPT | Task | Metric | Value |
---|---|---|---|
boolq | acc | 0.38 | |
hellaswag | acc | 0.26 | |
hellaswag | acc_norm | 0.26 | |
openbookqa | acc | 0.19 | |
openbookqa | acc_norm | 0.28 | |
piqa | acc | 0.54 | |
piqa | acc_norm | 0.54 | |
winogrande | acc | 0.48 |
2. lm-eval |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.3798 | ± | 0.0085 | |
hellaswag | Yaml | none | 0 | acc | 0.2610 | ± | 0.0044 | |
none | 0 | acc_norm | 0.2590 | ± | 0.0044 | |||
openbookqa | Yaml | none | 0 | acc | 0.1320 | ± | 0.0152 | |
none | 0 | acc_norm | 0.2760 | ± | 0.0200 | |||
piqa | Yaml | none | 0 | acc | 0.5571 | ± | 0.0116 | |
none | 0 | acc_norm | 0.5571 | ± | 0.0116 | |||
winogrande | Yaml | none | 0 | acc | 0.5020 | ± | 0.0141 |
1. Lit-GPT | Task | Metric | Value |
---|---|---|---|
boolq | acc | 0.41 | |
hellaswag | acc | 0.26 | |
hellaswag | acc_norm | 0.27 | |
openbookqa | acc | 0.17 | |
openbookqa | acc_norm | 0.26 | |
piqa | acc | 0.56 | |
piqa | acc_norm | 0.56 | |
winogrande | acc | 0.49 |
2. lm-eval |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.5232 | ± | 0.0087 | |
hellaswag | Yaml | none | 0 | acc | 0.2661 | ± | 0.0044 | |
none | 0 | acc_norm | 0.2749 | ± | 0.0045 | |||
openbookqa | Yaml | none | 0 | acc | 0.1280 | ± | 0.0150 | |
none | 0 | acc_norm | 0.2480 | ± | 0.0193 | |||
piqa | Yaml | none | 0 | acc | 0.5947 | ± | 0.0115 | |
none | 0 | acc_norm | 0.5909 | ± | 0.0115 | |||
winogrande | Yaml | none | 0 | acc | 0.5272 | ± | 0.0140 |
1. Lit-GPT | Task | Metric | Value |
---|---|---|---|
boolq | acc | 0.44 | |
hellaswag | acc | 0.28 | |
hellaswag | acc_norm | 0.30 | |
openbookqa | acc | 0.18 | |
openbookqa | acc_norm | 0.26 | |
piqa | acc | 0.59 | |
piqa | acc_norm | 0.60 | |
winogrande | acc | 0.53 |
2. lm-eval |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.5688 | ± | 0.0087 | |
hellaswag | Yaml | none | 0 | acc | 0.2838 | ± | 0.0045 | |
none | 0 | acc_norm | 0.3027 | ± | 0.0046 | |||
openbookqa | Yaml | none | 0 | acc | 0.1500 | ± | 0.0160 | |
none | 0 | acc_norm | 0.2680 | ± | 0.0198 | |||
piqa | Yaml | none | 0 | acc | 0.6230 | ± | 0.0113 | |
none | 0 | acc_norm | 0.6192 | ± | 0.0113 | |||
winogrande | Yaml | none | 0 | acc | 0.5130 | ± | 0.0140 |
1. Lit-GPT | Task | Metric | Value |
---|---|---|---|
boolq | acc | 0.59 | |
hellaswag | acc | 0.34 | |
hellaswag | acc_norm | 0.40 | |
openbookqa | acc | 0.18 | |
openbookqa | acc_norm | 0.29 | |
piqa | acc | 0.67 | |
piqa | acc_norm | 0.67 | |
winogrande | acc | 0.53 |
2. lm-eval |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.6089 | ± | 0.0085 | |
hellaswag | Yaml | none | 0 | acc | 0.3373 | ± | 0.0047 | |
none | 0 | acc_norm | 0.4057 | ± | 0.0049 | |||
openbookqa | Yaml | none | 0 | acc | 0.1800 | ± | 0.0172 | |
none | 0 | acc_norm | 0.2940 | ± | 0.0204 | |||
piqa | Yaml | none | 0 | acc | 0.6692 | ± | 0.0110 | |
none | 0 | acc_norm | 0.6692 | ± | 0.0110 | |||
winogrande | Yaml | none | 0 | acc | 0.5375 | ± | 0.0140 |
1. Lit-GPT | Task | Metric | Value |
---|---|---|---|
boolq | acc | 0.63 | |
hellaswag | acc | 0.40 | |
hellaswag | acc_norm | 0.52 | |
openbookqa | acc | 0.22 | |
openbookqa | acc_norm | 0.34 | |
piqa | acc | 0.71 | |
piqa | acc_norm | 0.71 | |
winogrande | acc | 0.57 |
2. lm-eval |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
boolq | Yaml | none | 0 | acc | 0.6315 | ± | 0.0084 | |
hellaswag | Yaml | none | 0 | acc | 0.4045 | ± | 0.0049 | |
none | 0 | acc_norm | 0.5204 | ± | 0.0050 | |||
openbookqa | Yaml | none | 0 | acc | 0.2220 | ± | 0.0186 | |
none | 0 | acc_norm | 0.3320 | ± | 0.0211 | |||
piqa | Yaml | none | 0 | acc | 0.7084 | ± | 0.0106 | |
none | 0 | acc_norm | 0.7095 | ± | 0.0106 | |||
winogrande | Yaml | none | 0 | acc | 0.5738 | ± | 0.0139 |
Hey @ajtejankar Thanks for such a report! It looks like there is a problem with smaller versions of Pythia model. Though, I don't know who does it wrongly: HF or Lit-GPT 😄.
I'll take a look at the code in Lit-GPT vs Hugginface Transformer a bit closer. But since large versions of models are the priority for this repo, I cannot say when it will happen.
Or if you want to dig in and contribute - it would be awesome.
Hi @Andrei-Aksionov,
Sure, no worries. I am definitely planning to take a look. I don't think it should be too hard.
Thanks for the help!
In https://github.com/Lightning-AI/lit-gpt/blob/main/tests/test_model.py#L18-L85 you'll find a test for the pythia model config comparing lit-gpt and huggingface.
Note that numerical difference is expected in 16-bit precision: https://github.com/Lightning-AI/lit-gpt/blob/main/tests/test_model.py#L31-L32. It would be interesting to rerun those tables using 32-bit precision
It would be interesting to rerun those tables using 32-bit precision
If to set precision to float32
for tests on GPU - they pass successfully.
Also tried the same with the "full-size" config for Pythia models from 14m to 1b. All tests are passed with float32.
With float16 the bigger the model the larger percentage of non-matching tensors and the larger the max abs difference. Which doesn't sync with the results obtained above - there the larger the model the more results are similar between lm-eval and lit-gpt.
Hi,
I was trying to evaluate the Pythia-160M model against some tasks in
lm-eval-harness
and noticed that the results produced by the code in lit-gpt/eval and the latest version oflm-eval-harness
are different. Here're the outputs of the two commands.Command
Result
Command
Result
As you can see for some tasks like BoolQ and PIQA the results are quite different. I wonder what could cause such a big difference.
Best, Ajinkya