bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models
https://arxiv.org/abs/2308.07124
MIT License
431 stars 27 forks source link

Performance of WizardCoder on HumanEvalFixDocs #18

Closed awasthiabhijeet closed 1 year ago

awasthiabhijeet commented 1 year ago

Hi @Muennighoff,

Thank you for releasing many useful resources.

QQ: Do you know what is the accuracy of WizardCoder-15.5B on HumanEvalFixDocs? (i.e. where does WizardCoder stand in Table12 of your paper?)

Muennighoff commented 1 year ago

We didn't run that. I think it would be somewhere between OctoCoder & GPT-4. You can run it easily like below tho:

accelerate launch main.py \
--model WizardCoder-15.5B  \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt wizardcoder \
--save_generations_path generations_humanevalfixdocspython_wizardcoder.json \
--metric_output_path evaluation_humanevalfixdocspython_wizardcoder.json \
--max_length_generation 2048 \
--precision bf16
awasthiabhijeet commented 1 year ago

Thanks!

awasthiabhijeet commented 1 year ago

I observe an accuracy of 51.2 with WizardCoder. (Thus lower than OctoCoder, and also lower than WizardCoder's performance on HumanEval)

PS: I'm reporting with greedy decoding and pass@1 score.

As a sanity check, would it be possible for you to confirm if you are observing the same performance? CC: @Muennighoff

awasthiabhijeet commented 1 year ago

With Greedy decoding, StarCoder gives a pass@1 of 61.6 image

Muennighoff commented 1 year ago

I observe an accuracy of 51.2 with WizardCoder. (Thus lower than OctoCoder, and also lower than WizardCoder's performance on HumanEval)

PS: I'm reporting with greedy decoding and pass@1 score.

As a sanity check, would it be possible for you to confirm if you are observing the same performance? CC: @Muennighoff

Using --temperature 0.2 --n_samples 20 would likely increase the score a bit.

For StarCoder which prompt are you using? Surprised it is that high. Would be curious to know what you get for --temperature 0.2 --n_samples 20

awasthiabhijeet commented 1 year ago

I am using starcodercommit prompt, when I get 61.6

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample False \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16

Will try to run with 20 samples and temperature 0.2.

CC: @Muennighoff

awasthiabhijeet commented 1 year ago

With 20 samples and T=0.2, I observe the following result (pass@1 of 58.9 compared to 43.5 reported in the paper..)

{
  "humanevalfixdocs-python": {
    "pass@1": 0.589329268292683,
    "pass@10": 0.6989868047455075
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

Would it be possible for you to re-compute these numbers just to be sure?

Muennighoff commented 1 year ago

Discussion moved to https://github.com/bigcode-project/octopack/issues/21