bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

HumanEval post-processing #46

Closed RaymondLi0 closed 1 year ago

RaymondLi0 commented 1 year ago

For the HumanEval task, we remove the last block, based on the stop tokens: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/tasks/humaneval.py#L70

If no stopword is found in the generation (for example if by chance the generation ends exactly at the function's last return statement, or before), then remove_last_block would remove the entire generation and return an empty string.

It seems to me that we should rather: remove anything that is after the first block, if there ever is a match with one of the stop tokens

If this issue makes sense, happy to create a PR for that.

loubnabnl commented 1 year ago

Actually, with the way we do the generation, there will always be an eof_string in the generation before it stops due to this function so remove_last_block always keeps the solution and removes some excess.

What might happen is having some intermediate print/comment/function left in between but I think that shouldn't impact the evaluation (shouldn't happen either as we stop at the first occurence). But I agree keep_first_block like we do in MBPP seems cleaner. Feel free to open a PR.

loubnabnl commented 1 year ago

closing the issue as this was fixed in https://github.com/bigcode-project/bigcode-evaluation-harness/pull/63