bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

The results of Llama3-8b pass@1 is worse than report #228

Open shuaiwang2022 opened 2 months ago

shuaiwang2022 commented 2 months ago

image

loubnabnl commented 2 months ago

I think the LLama3 blog post evaluates the instruct model in the blogpost not the base model. Your pass@1 is close to the numbers that the community is reporting for llama3-8B-base. See: https://twitter.com/huybery/status/1781172838361334015

shuaiwang2022 commented 2 months ago

Despite using Llama-8b-base, I scored 31.71 on HumanEval, which is lower than their 33.50.

shuaiwang2022 commented 2 months ago

I scored 57.32 on Llama-3-8B-Instruct. image

TheEmancipator commented 2 months ago

n_samples looks too small to get a generalized result. Try again with n_samples >= 50

loubnabnl commented 2 months ago

Indeed, if you're using n_samples=1 set the generation to greedy do_sample=False. To use sampling (which generalizes better with a high number of samples) set n_samples to 50. But note that there might be evaluation differences if the results are from different setups/frameworks.

moyi-qwq commented 1 month ago

I also scored 57.32 on llama3-8b-instruct, which differs from the 62.2% reported in the llama3 blog. This discrepancy could be due to the prompt template I used. Since generating 50 samples would be slow, I opted for 10 instead. Evaluating generations... { "humaneval": { "pass@1": 0.573170731707317, "pass@10": 0.6524390243902439 }, "config": { "prefix": "", "do_sample": true, "temperature": 0.1, "top_k": 0, "top_p": 0.95, "n_samples": 10, "eos": "<|endoftext|>", "seed": 0, "model": "/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298", "modeltype": "causal", "peft_model": null, "revision": null, "use_auth_token": false, "trust_remote_code": false, "tasks": "humaneval", "instruction_tokens": null, "batch_size": 10, "max_length_generation": 512, "precision": "fp32", "load_in_8bit": false, "load_in_4bit": false, "left_padding": false, "limit": null, "limit_start": 0, "save_every_k_tasks": -1, "postprocess": true, "allow_code_execution": true, "generation_only": false, "load_generations_path": null, "load_data_path": null, "metric_output_path": "evaluation_results.json", "save_generations": true, "load_generations_intermediate_paths": null, "save_generations_path": "generations.json", "save_references": false, "save_references_path": "references.json", "prompt": "prompt", "max_memory_per_gpu": null, "check_references": false } }