Closed VoiceBeer closed 12 months ago
Quite strange, I got 40.2% with the same script using another conda environment.
{
"humaneval": {
"pass@1": 0.4024390243902439
},
"config": {
"prefix": "",
"do_sample": false,
"temperature": 0.2,
"top_k": 0,
"top_p": 0.95,
"n_samples": 1,
"eos": "<|endoftext|>",
"seed": 0,
"model": "/mnt/models/CodeLlama-7b-Python",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": true,
"trust_remote_code": true,
"tasks": "humaneval",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 512,
"precision": "bf16",
"load_in_8bit": false,
"load_in_4bit": false,
"limit": null,
"limit_start": 0,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "/root/ch/bigcode-evaluation-harness/outputs/metrics/humaneval_codellama_7b_python.json",
"save_generations": true,
"save_generations_path": "/root/ch/bigcode-evaluation-harness/outputs/generations/humaneval_codellama_7b_python.json",
"save_references": false,
"prompt": "prompt",
"max_memory_per_gpu": "auto",
"check_references": false
}
}
What happened? The former environment is created using the requirement.txt with python=3.7.10, while the next one has python of 3.10.0
I haven't tried execution with python 3.7 but it could be that the execution of the generated solution failed in that env. Try evaluationg the generations you got with python 3.10 in your python 3.7 env using evaluation only mode by providing this flag
--load_generations_path /root/ch/bigcode-evaluation-harness/outputs/generations/humaneval_codellama_7b_python.json
https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main#evaluation-only
Thx @loubnabnl, I've figured it out :>
Hi, thx for the work.
I just tried to reproduce the results from the codellama paper. Here's the log file:
The result is far away from the 38.4% in the original paper. In issue#158, I noticed that for the instruction model, humanevalsynthesize-python should be used, while I only got only 1.8% too.
Would anyone be able to help reproduce the codellama results?