The result of llama2-13b-chat(pass@1 18) is worse than paper(pass@1 37)

moyi-qwq commented 6 months ago

{ "humaneval": { "pass@1": 0.18658536585365854, "pass@10": 0.2073170731707317 }, "config": { "prefix": "", "do_sample": true, "temperature": 0.1, "top_k": 0, "top_p": 0.95, "n_samples": 10, "eos": "<|endoftext|>", "seed": 0, "model": "huggingface/hub/models--meta-llama--Llama-2-13b-chat-hf/snapshots/a2cb7a712bb6e5e736ca7f8cd98167f81a0b5bd8", "modeltype": "causal", "peft_model": null, "revision": null, "use_auth_token": false, "trust_remote_code": false, "tasks": "humaneval", "instruction_tokens": null, "batch_size": 10, "max_length_generation": 512, "precision": "fp16", "load_in_8bit": false, "load_in_4bit": false, "left_padding": false, "limit": null, "limit_start": 0, "save_every_k_tasks": -1, "postprocess": true, "allow_code_execution": true, "generation_only": false, "load_generations_path": null, "load_data_path": null, "metric_output_path": "evaluation_results.json", "save_generations": true, "load_generations_intermediate_paths": null, "save_generations_path": "generations.json", "save_references": false, "save_references_path": "references.json", "prompt": "prompt", "max_memory_per_gpu": null, "check_references": false } }

loubnabnl commented 6 months ago

Hi, in which section did you find the 37 score? I only found the number of Llama2-13B-base

Instruct models usually need a chat format of the benchmark for optimal performance, we use humanevalsynthesize-python for this matter, you can find an example here https://github.com/bigcode-project/bigcode-evaluation-harness/issues/158#issuecomment-1792691690 (you might need to change the prompt template I don't think the existing ones work with llama2)

moyi-qwq commented 6 months ago

I made a mistake and mistook the GSM8K data for HumanEval data, I'm very sorry. The data for llama2-13b-chat can be found in the last graph of figure 3 at https://arxiv.org/abs/2402.05120. The footnote on page four clarifies that it's llama2chat rather than base. According to the graph, the score is roughly around 16%. Your framework scored around 18%, which is even better, that's impressive!

bigcode-project / bigcode-evaluation-harness

The result of llama2-13b-chat(pass@1 18) is worse than paper(pass@1 37) #233