bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
781 stars 208 forks source link

Reproduction Issues of Code Llama #167

Closed VoiceBeer closed 10 months ago

VoiceBeer commented 10 months ago

Hi, thx for the work.

I just tried to reproduce the results from the codellama paper. Here's the log file:

{
  "humaneval": {
    "pass@1": 0.018292682926829267
  },
  "config": {
    "prefix": "",
    "do_sample": false,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "/mnt/models/CodeLlama-7b-Python",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 512,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "/root/ch/bigcode-evaluation-harness/outputs/metrics/humaneval_codellama_7b_python.json",
    "save_generations": true,
    "save_generations_path": "/root/ch/bigcode-evaluation-harness/outputs/generations/humaneval_codellama_7b_python.json",
    "save_references": false,
    "prompt": "prompt",
    "max_memory_per_gpu": "auto",
    "check_references": false
  }
}

The result is far away from the 38.4% in the original paper. In issue#158, I noticed that for the instruction model, humanevalsynthesize-python should be used, while I only got only 1.8% too.

Would anyone be able to help reproduce the codellama results?

VoiceBeer commented 10 months ago

Quite strange, I got 40.2% with the same script using another conda environment.

{
  "humaneval": {
    "pass@1": 0.4024390243902439
  },
  "config": {
    "prefix": "",
    "do_sample": false,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "/mnt/models/CodeLlama-7b-Python",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": true,
    "trust_remote_code": true,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 512,
    "precision": "bf16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "/root/ch/bigcode-evaluation-harness/outputs/metrics/humaneval_codellama_7b_python.json",
    "save_generations": true,
    "save_generations_path": "/root/ch/bigcode-evaluation-harness/outputs/generations/humaneval_codellama_7b_python.json",
    "save_references": false,
    "prompt": "prompt",
    "max_memory_per_gpu": "auto",
    "check_references": false
  }
}

What happened? The former environment is created using the requirement.txt with python=3.7.10, while the next one has python of 3.10.0

loubnabnl commented 10 months ago

I haven't tried execution with python 3.7 but it could be that the execution of the generated solution failed in that env. Try evaluationg the generations you got with python 3.10 in your python 3.7 env using evaluation only mode by providing this flag

--load_generations_path /root/ch/bigcode-evaluation-harness/outputs/generations/humaneval_codellama_7b_python.json

https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main#evaluation-only

VoiceBeer commented 10 months ago

Thx @loubnabnl, I've figured it out :>