bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
710 stars 183 forks source link

bug in post processing for mbpp task #121

Closed weiliang-zeng closed 11 months ago

weiliang-zeng commented 11 months ago

Hi, I found a bug in the post-processing for the mbpp task. This bug degrades the performance of StarCoder on the mbpp dataset.

The first_block function is supposed to extract the code block by scanning the stop_words. However, stop_words is appended with tokenizer.eos_token in function parallel_generations (generation.py), and it includes <|endoftext|> in the list.

Token <|endoftext|> has |, which is interpreted by the regular expression and then following codes in first_block split based on >, which is wrong. I found some errors in the mbpp task are simply because of this. If this is fixed, the StarCoder performance can be better than before.

    def first_block(string, stop_words):
        """Split off first block of code by scanning for class, def etc. on newlines."""
        return re.split("|".join(stop_words), string)[0].rstrip()

The fix is straightforward. You can use _stop_at_stop_token function in humaneval.py. After this is fixed, the performance of StarCoder on mbpp task in the paper needs to be updated.

loubnabnl commented 11 months ago

Thanks for the catch! Would you like to open a PR to update it? If not happy to do it

(The paper used MBPP from MultiPL-E which didn't have this bug)

weiliang-zeng commented 11 months ago

I see. Thanks for the clarification. MBPP from MultiPL-E has modified the original MBPP.

btw, I didn't see the MultiPL-E in this repo supporting MBPP yet. Is this correct?

Regarding the bug fix here, please go ahead. Thanks for the great work on this repo.

loubnabnl commented 11 months ago

Issue fixed in https://github.com/bigcode-project/bigcode-evaluation-harness/pull/124 Regarding MultiPL-E, no MBPP isn't added yet.