Closed icoderzqliu closed 5 months ago
The default is 512 (works fine with HumanEval) but for some tasks you might need more try setting it to 1024 for mbpp. Regarding the impact on the results, if the benchmark has long prompts you want to have a higher max_length to have room for generation otherwise the solutions won't be complete.
@loubnabnl I was facing the same issue for multiple-java and mulitple-cpp while trying to reproduce the leaderboard score for codellama-7b using the steps given in leaderboard README here https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/leaderboard#2--generation
is it supposed to be 1024 for multiple-cpp and multiple-java as well?
I was confused beacause in the leaderboad About section it is mentioned that
All models were evaluated with the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main) with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.
Hi sorry for the confusion, if this happens try 1024, some tokenizers might generate more tokens than others which takes more space. Will update the "About" section of the leaderboard.
It seems MBPP has a prompt with 1700 tokens with some tokenizers, after this PR https://github.com/bigcode-project/bigcode-evaluation-harness/pull/244 you should be able to run the evaluation with a smaller max_length but you might get lower scores as the solutions to some long prompts won't be generated
You mentioned in the readme that max_length_generation=512 is enough for tasks like HumanEval and MBPP, but when I tested phi-1.5 and deepseek-coder-1.3b-base on the mbpp task, the following problems occurred at max_length_gen = 512.
How should this parameter be set so that the test results can be aligned? Will the setting of this parameter have a significant impact on the results?