Closed weiliang-zeng closed 11 months ago
Thanks for the catch! Would you like to open a PR to update it? If not happy to do it
(The paper used MBPP from MultiPL-E which didn't have this bug)
I see. Thanks for the clarification. MBPP from MultiPL-E has modified the original MBPP.
btw, I didn't see the MultiPL-E in this repo supporting MBPP yet. Is this correct?
Regarding the bug fix here, please go ahead. Thanks for the great work on this repo.
Issue fixed in https://github.com/bigcode-project/bigcode-evaluation-harness/pull/124 Regarding MultiPL-E, no MBPP isn't added yet.
Hi, I found a bug in the post-processing for the mbpp task. This bug degrades the performance of StarCoder on the mbpp dataset.
The
first_block
function is supposed to extract the code block by scanning the stop_words. However,stop_words
is appended withtokenizer.eos_token
in functionparallel_generations
(generation.py
), and it includes<|endoftext|>
in the list.Token
<|endoftext|>
has|
, which is interpreted by the regular expression and then following codes infirst_block
split based on>
, which is wrong. I found some errors in the mbpp task are simply because of this. If this is fixed, the StarCoder performance can be better than before.The fix is straightforward. You can use
_stop_at_stop_token
function inhumaneval.py
. After this is fixed, the performance of StarCoder on mbpp task in the paper needs to be updated.