Is GSM100 evaluated using 8-shot or 16-shot?

OpenLMLab / LEval

[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark

GNU General Public License v3.0

349 stars 14 forks source link

Is GSM100 evaluated using 8-shot or 16-shot? #9

Closed zhimin-z closed 10 months ago

zhimin-z commented 10 months ago

ChenxinAn-fdu commented 10 months ago

Hi zhimin! Thank you for the question. GSM100 uses 16 examples where 8 of these examples are token from chain-of-thought-hub and the remaining 8 examples are written by us. We verify our 8 examples on turbo-16k results show that using more examples improves turbo from 80 -> 84.