bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

max_length_generation parameter #207

Closed icoderzqliu closed 1 week ago

icoderzqliu commented 3 months ago

You mentioned in the readme that max_length_generation=512 is enough for tasks like HumanEval and MBPP, but when I tested phi-1.5 and deepseek-coder-1.3b-base on the mbpp task, the following problems occurred at max_length_gen = 512.

ValueError: Input length of input_ids is 512, but `max_length` is set to 512. This can lead to unexpected behavior. You should consider increasing `max_length` or
, better yet, setting `max_new_tokens`.

How should this parameter be set so that the test results can be aligned? Will the setting of this parameter have a significant impact on the results?

loubnabnl commented 3 months ago

The default is 512 (works fine with HumanEval) but for some tasks you might need more try setting it to 1024 for mbpp. Regarding the impact on the results, if the benchmark has long prompts you want to have a higher max_length to have room for generation otherwise the solutions won't be complete.

toptechie156 commented 2 months ago

@loubnabnl I was facing the same issue for multiple-java and mulitple-cpp while trying to reproduce the leaderboard score for codellama-7b using the steps given in leaderboard README here https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/leaderboard#2--generation

is it supposed to be 1024 for multiple-cpp and multiple-java as well?

I was confused beacause in the leaderboad About section it is mentioned that

All models were evaluated with the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main) with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.

image
loubnabnl commented 2 months ago

Hi sorry for the confusion, if this happens try 1024, some tokenizers might generate more tokens than others which takes more space. Will update the "About" section of the leaderboard.

loubnabnl commented 1 week ago

It seems MBPP has a prompt with 1700 tokens with some tokenizers, after this PR https://github.com/bigcode-project/bigcode-evaluation-harness/pull/244 you should be able to run the evaluation with a smaller max_length but you might get lower scores as the solutions to some long prompts won't be generated