bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
710 stars 183 forks source link

add codellama prompt to humanevalsynthesize #130

Closed loubnabnl closed 10 months ago

awasthiabhijeet commented 9 months ago

Hi @loubnabnl ,

What performance do you get with CodeLlama models on HumanEval synthesize?

Surprisingly, with CodeLlama-7B-Instruct and CodeLlama-13B-Instruct, I observe better numbers on HumanEval synthesize than what is reported in the CodeLlama paper.

This is the pass@1 score that CodeLlama paper reports for the Instruct models: image

I observe pass@1 scores of 47 and 50.6 with 7B and 13B Instruct models.

Could this be due to better post-processing in this library? (I am assuming HumanEval and HumanEvalSynthesize are same)

Here is my output for 13B models.

{
  "humanevalsynthesize-python": {
    "pass@1": 0.5060975609756098
  },
  "config": {
    "prefix": "",
    "do_sample": false,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "ckpt_copy",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalsynthesize-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "ckpt_copy/evaluation_humanevalsynthesize-python_codellama.json",
    "save_generations": true,
    "save_generations_path": "ckpt_copy/generations_humanevalsynthesize-python_codellama.json",
    "save_references": false,
    "prompt": "codellama",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

CC: @Muennighoff

loubnabnl commented 8 months ago

answered here https://github.com/bigcode-project/bigcode-evaluation-harness/issues/142#issuecomment-1772969343