EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.87k stars 1.57k forks source link

Benchmark tests giving different results #1269

Closed TheFloatingString closed 5 months ago

TheFloatingString commented 6 months ago

The scores we get using lm_eval for an HF model today are different from the scores we got earlier this month:

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m --tasks lambada_openai --device cuda:0 --batch_size 8 --verbosity DEBUG

We now get a perplexity of NaN and an accuracy of 0.0058.

In comparison, on around January 4th, we got a perplexity of 130.9624 and an accuracy of 0.2272.

I tried doing a rollback to the v0.4.0 commit from December 2023, and still get the perplexity of NaN and accuracy of 0.0058.

Are there any probable causes behind this divergence in benchmarks, and would there be any fixes that I could help look into?

haileyschoelkopf commented 6 months ago

Hi! Do you observe divergences on models other than Pythia-70m, say on gpt2 or EleutherAI/pythia-70m-v0? Additionally, could you share your config from accelerate, in particular any settings around mixed-precision?

I do not know of any changes between now and Jan 4 that might cause this--though quite a while ago now we switched to using dtype="auto" for HF models, meaning fp32 is no longer the default for some models.

This should be solvable for Pythia-70m by using --model_args dtype=float32. We've observed in the past that smaller Pythia models seem to be especially sensitive to precision, getting NaNs occasionally in 16-bit precision, and have quite large hidden state norms and still are not certain of the root cause of this. You could try running using mixed precision (using AMP / autocast, which we don't currently have natively) and seeing if you can again achieve the same / similar performance.

TheFloatingString commented 6 months ago

Thank you for your help!

I re-ran some preliminary results, and found that:

Some additional questions that we had were:

Here's a cleaned log of the new lm_eval runs:

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  |Value |   |Stderr|
|--------------|-------|------|-----:|----------|-----:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|   NaN|±  |   NaN|
|              |       |none  |     0|acc       |0.0056|±  | 0.001|

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m,dtype=float32 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  | Value  |   |Stderr|
|--------------|-------|------|-----:|----------|-------:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|130.9651|±  |5.5012|
|              |       |none  |     0|acc       |  0.2272|±  |0.0058|

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m-v0 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m-v0), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  | Value |   |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|97.4842|±  | 3.995|
|              |       |none  |     0|acc       | 0.2414|±  | 0.006|

lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m-v0,dtype=float32 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m-v0,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  | Value |   |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|91.6107|±  |3.7984|
|              |       |none  |     0|acc       | 0.2587|±  |0.0061|

lm_eval --model hf --model_args pretrained=gpt2,dtype=float32 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=gpt2,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  | Value |   |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|40.0554|±  |1.4787|
|              |       |none  |     0|acc       | 0.3256|±  |0.0065|

lm_eval --model hf --model_args pretrained=gpt2 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=gpt2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  | Value |   |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml   |none  |     0|perplexity|40.0554|±  |1.4787|
|              |       |none  |     0|acc       | 0.3256|±  |0.0065|
haileyschoelkopf commented 6 months ago

You can run accelerate config in the command line to set up accelerate's config. It should also be visible at ~/.cache/huggingface/accelerate/default_config.yaml if you've set it up before, if I recall correctly.

I'd recommend using fp32 for the Pythia models consistently throughout your research paper, and noting in a footnote or appendix the discrepancies that arise when not using 32-bit precision. If you also provide a link to a codebase, then you can include the exact output files / exact commands run in lm-eval / codebase commit when running your experiments.

TheFloatingString commented 6 months ago

Sounds good, thanks so much for the info!

We ran some of our lm_eval results on a cloud-based GPU, so we'll have to do some searching to figure out where the default_config.yaml is. In previous runs, I don't believe we ran accelerate config, so we'll also look into that command to standardize the params for the tests we ran.

Our group is planning to re-run our lm_eval test in the next couple of days; would it be possible to keep this issue open for just a bit longer, in case we observe any other discrepancies in our results?

haileyschoelkopf commented 6 months ago

Yup, feel free to ping if you continue to observe any weirdness.

haileyschoelkopf commented 5 months ago

Closing for ease of tracking open Github issues, but please feel free to ping me on Discord if you encounter any extra issues or inconsistencies!