Performance of StarCoder on HumanEvalFixDocs

awasthiabhijeet commented 11 months ago

With StarCoder, I am observing a pass@1 score of 58.9 instead of 43.5 as reported in the OctoCoder paper.

Script used:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 2048 \
--precision fp16

Results:

{
  "humanevalfixdocs-python": {
    "pass@1": 0.589329268292683,
    "pass@10": 0.6989868047455075
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_sample_prompt.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

CC: @Muennighoff

Muennighoff commented 11 months ago

A few things are different in the command we ran: We use --bf16 instead of fp16, --max_length_generation 1800 & --batch_size 5. All of them can slightly affect the score though I would be surprised if by so much. You can verify the 43.5 we got here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json & the generations here https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/generations_humanevalfixdocspy_starcoder_temp02.json. If you want you can directly compare the generations to yours to see where the discrepancies may be.

Overall, yeah the commit format on the pretrained StarCoder works really well. On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G. The problem of the commit format is that it does not work well for code synthesis or explanation.

awasthiabhijeet commented 11 months ago

On the regular HumanEvalFix, StarCoder + Commit Format also outperforms OctoCoder, see the below Table from Appendix G.

This is helpful. Thanks! I feel this deserves a mention in Table-2 itself then :)

Could you also share the script you use to obtain https://huggingface.co/datasets/bigcode/evaluation/blob/main/starcoder/humanevalfixdocs/commit_format/evaluation_humanevalfixdocspy_starcoder_temp02.json? I can try re-running it in the exact same config that you used.

Thanks!

Muennighoff commented 11 months ago

Sure it would be:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16

awasthiabhijeet commented 11 months ago

Sure it would be:

accelerate launch main.py \
--model $MODEL_DIR \
--tasks humanevalfixdocs-python \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt starcodercommit \
--save_generations_path $MODEL_DIR/generations_humanevalfixdocspython_starcodercommit_prompt.json \
--metric_output_path $MODEL_DIR/evaluation_humanevalfixdocspython_starcodercommit_prompt.json \
--max_length_generation 1800 \
--precision bf16

With this script, I observe a pass@1 score of 60.1.

{
  "humanevalfixdocs-python": {
    "pass@1": 0.6009146341463415,
    "pass@10": 0.6974812593960444
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 5,
    "max_length_generation": 1800,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "starcoder/evaluation_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
    "save_generations": true,
    "save_generations_path": "starcoder/generations_humanevalfixdocspython_starcodercommit_prompt_bf16.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

awasthiabhijeet commented 11 months ago

CC: @Muennighoff

Muennighoff commented 11 months ago

You're right, it seems the result in the paper is too low. I reran it & got the below:

{
  "humanevalfixdocs-python": {
    "pass@1": 0.5878048780487805,
    "pass@10": 0.6939082542089792
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 20,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "starcoder",
    "modeltype": "causal",
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalfixdocs-python",
    "instruction_tokens": null,
    "batch_size": 5,
    "max_length_generation": 1800,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_humanevalfixdocspython_starcoder_temp02_commit.json",
    "save_generations": true,
    "save_generations_path": "generations_humanevalfixdocspython_starcoder_temp02_commit.json",
    "save_references": false,
    "prompt": "starcodercommit",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

I will update the paper soon. Thanks a lot for noting this!

awasthiabhijeet commented 11 months ago

Thanks @Muennighoff :)

Muennighoff commented 11 months ago

Attached is how the new section will look like including the updated results. Thanks again!

awasthiabhijeet commented 11 months ago

Thanks @Muennighoff, this is very helpful! :)

bigcode-project / octopack

Performance of StarCoder on HumanEvalFixDocs #21