Closed TheFloatingString closed 5 months ago
Hi! Do you observe divergences on models other than Pythia-70m, say on gpt2
or EleutherAI/pythia-70m-v0
? Additionally, could you share your config from accelerate
, in particular any settings around mixed-precision?
I do not know of any changes between now and Jan 4 that might cause this--though quite a while ago now we switched to using dtype="auto"
for HF models, meaning fp32 is no longer the default for some models.
This should be solvable for Pythia-70m by using --model_args dtype=float32
. We've observed in the past that smaller Pythia models seem to be especially sensitive to precision, getting NaNs occasionally in 16-bit precision, and have quite large hidden state norms and still are not certain of the root cause of this. You could try running using mixed precision (using AMP / autocast, which we don't currently have natively) and seeing if you can again achieve the same / similar performance.
Thank you for your help!
I re-ran some preliminary results, and found that:
EleutherAI/pythia-70m
, specifying dtype=float32
allows us to reproduce our lambada_openai
scores up to the second decimal point, from our Jan. 4th testsEleutherAI/pythia-70m-v0
, specifying dtype=float32
slightly improves model performance, relative to without specifying the dtype
gpt2
, specifying dtype=float32
does not affect the resulting scoresSome additional questions that we had were:
accelerate
config?dtype=float32
, or are there other configurations we should follow when presenting model results for research publications?Here's a cleaned log of the new lm_eval
runs:
lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|--------------|-------|------|-----:|----------|-----:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity| NaN|± | NaN|
| | |none | 0|acc |0.0056|± | 0.001|
lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m,dtype=float32 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|-------|------|-----:|----------|-------:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity|130.9651|± |5.5012|
| | |none | 0|acc | 0.2272|± |0.0058|
lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m-v0 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m-v0), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity|97.4842|± | 3.995|
| | |none | 0|acc | 0.2414|± | 0.006|
lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m-v0,dtype=float32 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=EleutherAI/pythia-70m-v0,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity|91.6107|± |3.7984|
| | |none | 0|acc | 0.2587|± |0.0061|
lm_eval --model hf --model_args pretrained=gpt2,dtype=float32 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=gpt2,dtype=float32), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity|40.0554|± |1.4787|
| | |none | 0|acc | 0.3256|± |0.0065|
lm_eval --model hf --model_args pretrained=gpt2 --tasks lambada_openai --device cuda:0 --batch_size 8
hf (pretrained=gpt2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|-------|------|-----:|----------|------:|---|-----:|
|lambada_openai|Yaml |none | 0|perplexity|40.0554|± |1.4787|
| | |none | 0|acc | 0.3256|± |0.0065|
You can run accelerate config
in the command line to set up accelerate's config. It should also be visible at ~/.cache/huggingface/accelerate/default_config.yaml
if you've set it up before, if I recall correctly.
I'd recommend using fp32 for the Pythia models consistently throughout your research paper, and noting in a footnote or appendix the discrepancies that arise when not using 32-bit precision. If you also provide a link to a codebase, then you can include the exact output files / exact commands run in lm-eval / codebase commit when running your experiments.
Sounds good, thanks so much for the info!
We ran some of our lm_eval
results on a cloud-based GPU, so we'll have to do some searching to figure out where the default_config.yaml
is. In previous runs, I don't believe we ran accelerate config
, so we'll also look into that command to standardize the params for the tests we ran.
Our group is planning to re-run our lm_eval
test in the next couple of days; would it be possible to keep this issue open for just a bit longer, in case we observe any other discrepancies in our results?
Yup, feel free to ping if you continue to observe any weirdness.
Closing for ease of tracking open Github issues, but please feel free to ping me on Discord if you encounter any extra issues or inconsistencies!
The scores we get using
lm_eval
for an HF model today are different from the scores we got earlier this month:lm_eval --model hf --model_args pretrained=EleutherAI/pythia-70m --tasks lambada_openai --device cuda:0 --batch_size 8 --verbosity DEBUG
We now get a perplexity of
NaN
and an accuracy of0.0058
.In comparison, on around January 4th, we got a perplexity of
130.9624
and an accuracy of0.2272
.I tried doing a rollback to the
v0.4.0
commit from December 2023, and still get the perplexity ofNaN
and accuracy of0.0058
.Are there any probable causes behind this divergence in benchmarks, and would there be any fixes that I could help look into?