Closed xh-yuan closed 1 month ago
It is somewhat hard for me to tell what the problem is, as evaluation is sensitive to a variety of factors (e.g, the vLLM version, the CUDA version, the generation config). I can provide the UltraEval version I used in the attachment. (ultraeval-07f99f7e.zip) The evaluation command is:
pip install .; python data_process.py; bash scripts/run_paper.sh --model_size 7b --port
I also find that you set the max_new_tokens
to 10 in your generation config, which might cause problems.
Thanks for the code! The result has been reproduced.
I follow the description from prosparse-7B and test the Acc on MMLU with ultraeval. MMLU average Acc is 41.69 but paper reports 45.21.
Here is one sample eval configuration:
generation_config:
prosparse-7B configuration: