Thanks for this great work!
I have a question about the evaluation of GSM8K.
Why only 200 examples are randomly sampled for the evaluation? as in ./scripts/eval/gsm.sh.
I have not found clarifications in the paper.
Are there some reasons for only 200 examples are used?
yes, we used 200 examples mainly because of efficiency concern. The eval of GSM used to be slow when we didn't incorporate vllm. We found the results of using 200 examples are quite consistent with using all.
Hi,
Thanks for this great work! I have a question about the evaluation of GSM8K. Why only 200 examples are randomly sampled for the evaluation? as in ./scripts/eval/gsm.sh. I have not found clarifications in the paper. Are there some reasons for only 200 examples are used?
Thanks!