Closed yogi9879 closed 1 month ago
Hi!
What are the settings you're running under? If you are running 0-shot, for example, and using multiple-choice or greedy generation without random sampling, then I'd expect results to not change much if at all.
(Stderrs are calculated via bootstrapping, and so are solely dependent on the resulting metric value--they won't capture any notion of variance over prompts, and aren't obtained by running multiple times)
Closing since this is likely expected behavior, but please feel free to reopen with more details if not!
we are trying to check stability of the model using different seeds. But getting same accuracy and std even if seeds are different
lm_eval --model hf --model_args pretrained=<model_path> --tasks mmlu_abstract_algebra --batch_size --seed <some_xyz_number>
.