Commit / Edit / Diff models & their evaluation

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

771 stars 201 forks source link

Looks good! Thanks for adding all these benchmarks! 🚀 When you're done it would be great if you could write a section in the docs explaining what each benchmark does and what params are best to run them. Especially since they're quite different from the other benchamrks.

Edit: regarding humaneval post-processing, you're right + removing just the last block might not be as good as just keeping the first block only. The stopping criteria stops generation only when all prompts in the batch reach a stop word hence possibility of leaving more than one stop word for some prompts (I'm making a PR to fix it).

Thanks for taking a look! Looking forward to those other PRs getting merged, then we can merge them into this PR.

bigcode-project / bigcode-evaluation-harness

Commit / Edit / Diff models & their evaluation #47