EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.34k stars 1.68k forks source link

Same accuracy and std with different seeds #2089

Closed yogi9879 closed 1 month ago

yogi9879 commented 1 month ago

we are trying to check stability of the model using different seeds. But getting same accuracy and std even if seeds are differentlm_eval --model hf --model_args pretrained=<model_path> --tasks mmlu_abstract_algebra --batch_size --seed <some_xyz_number>.

haileyschoelkopf commented 1 month ago

Hi!

What are the settings you're running under? If you are running 0-shot, for example, and using multiple-choice or greedy generation without random sampling, then I'd expect results to not change much if at all.

(Stderrs are calculated via bootstrapping, and so are solely dependent on the resulting metric value--they won't capture any notion of variance over prompts, and aren't obtained by running multiple times)

haileyschoelkopf commented 1 month ago

Closing since this is likely expected behavior, but please feel free to reopen with more details if not!