Closed huybery closed 1 year ago
Thanks for the feedback! You're using StarCoder and not StarCoderBase, right? Can you share your execution command? This should reproduce the result
accelerate launch main.py \
--model bigcode/starcoder \
--max_length_generation 512 \
--tasks humaneval \
--n_samples 50 \
--batch_size 50 \
--temperature 0.2 \
--precision bf16 \
--allow_code_execution \
--use_auth_token
EDIT: I just run
accelerate launch main.py --model bigcode/starcoder --max_length_generation 512 --tasks humaneval --n_samples 200 --batch_size 100 --temperature 0.2 --precision bf16 --allow_code_execution --use_auth_token
And it returns
{
"humaneval": {
"pass@1": 0.3357317073170732,
"pass@10": 0.4896174189684954,
"pass@100": 0.6159876811517714
},
"config": {
"model": "bigcode/starcoder",
"temperature": 0.2,
"n_samples": 200
}
}
which matches the paper result
It works! Thanks for your patient response. I have another question, what should I do if I want to evaluate humaneval using Starcoder's FIM mode? It seems that there is no corresponding CLI to activate it.
I found that there are related operations in:
Here is based on the type of prompt, which means I need to make changes to the get_prompt
function in
Am I right ?
You will need to another task (see guide) for the FIM humaneval dataset . In particular, you would need to change the prompt and post-processing of the solution, check this issue https://github.com/bigcode-project/bigcode-evaluation-harness/issues/69 for pointers about FIM evaluation we did for SantaCoder in another codebase. We do support FIM mode in generation though through the code you shared and it is currently used in the DS-1000 tasks)
Great, I see. Thanks for your help again !
Thank you for providing an excellent evaluation toolkit! It is very convenient and flexible.
But when I used the evaluation tool to evaluate the HumanEval performance on the statcoder, I obtained the following results.
It is lower than the paper result pass@1 is 33.6. Did I miss anything crucial? All parameters are default.