Closed Muennighoff closed 1 year ago
Looks good! Thanks for adding all these benchmarks! 🚀 When you're done it would be great if you could write a section in the docs explaining what each benchmark does and what params are best to run them. Especially since they're quite different from the other benchamrks.
Edit: regarding humaneval post-processing, you're right + removing just the last block might not be as good as just keeping the first block only. The stopping criteria stops generation only when all prompts in the batch reach a stop word hence possibility of leaving more than one stop word for some prompts (I'm making a PR to fix it).
Thanks for taking a look! Looking forward to those other PRs getting merged, then we can merge them into this PR.
See https://huggingface.co/datasets/bigcode/evaluation/tree/main for results
Regarding the HumanEval change - the prior splitting was a bug imo as it does not work when there is | in the input
Closed in favor of: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/120