allenai / open-instruct

Apache License 2.0
1.1k stars 145 forks source link

Question about the evaluation of GSM8K #101

Closed SihengLi99 closed 5 months ago

SihengLi99 commented 6 months ago

Hi,

Thanks for this great work! I have a question about the evaluation of GSM8K. Why only 200 examples are randomly sampled for the evaluation? as in ./scripts/eval/gsm.sh. I have not found clarifications in the paper. Are there some reasons for only 200 examples are used?

Thanks!

yizhongw commented 5 months ago

yes, we used 200 examples mainly because of efficiency concern. The eval of GSM used to be slow when we didn't incorporate vllm. We found the results of using 200 examples are quite consistent with using all.