bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Commit / Edit / Diff models & their evaluation #47

Closed Muennighoff closed 1 year ago

Muennighoff commented 1 year ago

See https://huggingface.co/datasets/bigcode/evaluation/tree/main for results

Regarding the HumanEval change - the prior splitting was a bug imo as it does not work when there is | in the input

>>> x = """    return True <|endoftext|> hello def kk <|>|> """
>>> re.split("(%s)" % "|".join(["<|endoftext|>"]), x)
['    return True ', '<', '|', 'endoftext', '|', '>', ' hello def kk ', '<', '|', '>', '|', '>', ' ']

Closed in favor of: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/120

Muennighoff commented 1 year ago

Looks good! Thanks for adding all these benchmarks! 🚀 When you're done it would be great if you could write a section in the docs explaining what each benchmark does and what params are best to run them. Especially since they're quite different from the other benchamrks.

Edit: regarding humaneval post-processing, you're right + removing just the last block might not be as good as just keeping the first block only. The stopping criteria stops generation only when all prompts in the batch reach a stop word hence possibility of leaving more than one stop word for some prompts (I'm making a PR to fix it).

Thanks for taking a look! Looking forward to those other PRs getting merged, then we can merge them into this PR.