bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

Check pass/fail count for humaneval #211

Closed toptechie156 closed 2 months ago

toptechie156 commented 3 months ago

Im running human eval for codellama/CodeLlama-7b-Instruct-hf model using the folllwing command

accelerate launch main.py --model codellama/CodeLlama-7b-Instruct-hf --max_length_generation 512 --tasks humaneval --temperature 0.2 --n_samples 1 --batch_size 1 --precision fp16 --load_in_4bit --allow_code_execution --save_generations --save_references --limit 5

Currently I can only access the final score of my test in the file evaluation_results.json

{
  "humaneval": {
    "pass@1": 0.8
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "codellama/CodeLlama-7b-Instruct-hf",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 512,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": true,
    "left_padding": false,
    "limit": 5,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": true,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": true,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

I need to check for which all problems, my model generated the correct code(unit tests passed) & for which all problems the testcases failed(and see what was the output for the tests)

loubnabnl commented 3 months ago

Hi you will need to change this line to save the details in addition to the score, now it's an empty variable https://github.com/bigcode-project/bigcode-evaluation-harness/blob/094c7cc197d13a53c19303865e2056f1c7488ac1/bigcode_eval/tasks/humaneval.py#L98 For example:

results, details = compute_code_eval(..)
details.to_json("details.json")

See doc: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/094c7cc197d13a53c19303865e2056f1c7488ac1/bigcode_eval/tasks/custom_metrics/code_eval.py#L78