bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
744 stars 193 forks source link

Reproducing the performance of HumanEval on starcoder #82

Closed huybery closed 1 year ago

huybery commented 1 year ago

Thank you for providing an excellent evaluation toolkit! It is very convenient and flexible.

But when I used the evaluation tool to evaluate the HumanEval performance on the statcoder, I obtained the following results.

{
  "humaneval": {
    "pass@1": 0.3011280487804878,
    "pass@10": 0.41708568124396794,
    "pass@100": 0.5175640419344132
  },
  "config": {
    "model": "../ckpt/starcoder",
    "temperature": 0.2,
    "n_samples": 200
  }
}

It is lower than the paper result pass@1 is 33.6. Did I miss anything crucial? All parameters are default.

loubnabnl commented 1 year ago

Thanks for the feedback! You're using StarCoder and not StarCoderBase, right? Can you share your execution command? This should reproduce the result

accelerate launch  main.py \
  --model bigcode/starcoder \
  --max_length_generation 512 \
  --tasks humaneval \
  --n_samples 50 \
  --batch_size 50 \
  --temperature 0.2 \
  --precision bf16 \
  --allow_code_execution \
  --use_auth_token

EDIT: I just run

  accelerate launch  main.py   --model bigcode/starcoder   --max_length_generation 512  --tasks humaneval   --n_samples 200   --batch_size 100   --temperature 0.2   --precision bf16   --allow_code_execution   --use_auth_token

And it returns

  {
  "humaneval": {
    "pass@1": 0.3357317073170732,
    "pass@10": 0.4896174189684954,
    "pass@100": 0.6159876811517714
  },
  "config": {
    "model": "bigcode/starcoder",
    "temperature": 0.2,
    "n_samples": 200
  }
}

which matches the paper result

huybery commented 1 year ago

It works! Thanks for your patient response. I have another question, what should I do if I want to evaluate humaneval using Starcoder's FIM mode? It seems that there is no corresponding CLI to activate it.

I found that there are related operations in:

https://github.com/bigcode-project/bigcode-evaluation-harness/blob/e2072f2e444bfc8f726326cf38050ca9e721fa94/lm_eval/utils.py#L46

Here is based on the type of prompt, which means I need to make changes to the get_prompt function in

https://github.com/bigcode-project/bigcode-evaluation-harness/blob/e2072f2e444bfc8f726326cf38050ca9e721fa94/lm_eval/tasks/humaneval.py#L46

Am I right ?

loubnabnl commented 1 year ago

You will need to another task (see guide) for the FIM humaneval dataset . In particular, you would need to change the prompt and post-processing of the solution, check this issue https://github.com/bigcode-project/bigcode-evaluation-harness/issues/69 for pointers about FIM evaluation we did for SantaCoder in another codebase. We do support FIM mode in generation though through the code you shared and it is currently used in the DS-1000 tasks)

huybery commented 1 year ago

Great, I see. Thanks for your help again !