bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
831 stars 219 forks source link

The results are different from those of the codellama paper #142

Closed ALLISWELL8 closed 5 months ago

ALLISWELL8 commented 1 year ago

The code llama experiment I conducted resulted in CodeLlama-7b-hf, and the dataset I selected was human eval

'pass_1': 0.2557317073170731, while the paper is' pass1 ': 0.335, and then my parameter setting - temperature 0.8- n Samples 100

loubnabnl commented 1 year ago

Hi, to measure pass@1 you need to use a lower temperature, usually 0.2 (or greedy with n_samples=1 that the CodeLlama authors used in their paper). With temperature 0.2 and n_samples=50 we get 29.9% pass@1 as shown in the Code leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard, the remaining discrepancy could be to them using a different post-processing or inference settings.

ALLISWELL8 commented 1 year ago

Okay, thank you very much. I have another question. If I want to test the model of Codellama-34b and my server has three A100 cards, how can I set the relevant parameters to run this one with a required graphics memory greater than 80GB

loubnabnl commented 1 year ago

You need to configure accelerate to use the 3 gpus by running accelerate config and if you want it to shard the model across gpus and automatically define how much memory is needed you can pass the flag --max_memory_per_gpu 'auto' You can also set the max memory to a specified value e.g 50GB

Below is an example of how we evaluated Falcon-180B on 8 A100 (note when running accelerate config don't select mixed precision, we add it using --precision flag in the command)

python     main.py \
    --model $org/$model \
    --tasks $task \
    --max_length_generation 512 \
    --batch_size 1 \
    --n_samples 1 \
    --do_sample False \
    --precision bf16 \
    --max_memory_per_gpu 'auto' \
    --allow_code_execution \
    --trust_remote_code \
    --save_generations \
    --use_auth_token \
    --generation_only \
    --save_generations_path $out_path/generations_$task\_$model.json \

But I think CodeLLama 34B fits on one A100 80GB in bf16 when doing greedy evaluation (i.e do_sample=False n_samples=1)

ALLISWELL8 commented 1 year ago

Thank you very much. Today, I ran the THUDM/chatglm2-6b model on this project and found that it reported an error: Error reported when running code: assert self. adding_ Side=="left" AssertionErro, if you have time, take a look. Thank you again.

------------------ 原始邮件 ------------------ 发件人: "bigcode-project/bigcode-evaluation-harness" @.>; 发送时间: 2023年10月12日(星期四) 下午4:26 @.>; 抄送: "Just @.**@.>; 主题: Re: [bigcode-project/bigcode-evaluation-harness] The results are different from those of the codellama paper (Issue #142)

You need to configure accelerate to use the 3 gpus by running accelerate config and if you want it to shard the model across gpus and automatically define how much memory is needed you can pass the flag --max_memory_per_gpu 'auto' You can also set the max memory to a specified value e.g 50GB

Below is an example of how we evaluated Falcon-180B on 8 A100 (note when running accelerate config don't select mixed precision, we add it using --precision flag in the command) python main.py \ --model $org/$model \ --tasks $task \ --max_length_generation 512 \ --batch_size 1 \ --n_samples 1 \ --do_sample False \ --precision bf16 \ --max_memory_per_gpu 'auto' \ --allow_code_execution \ --trust_remote_code \ --save_generations \ --use_auth_token \ --generation_only \ --save_generations_path $outpath/generations$task_$model.json \ `` But I think CodeLLama 34B fits on one A100 80GB in bf16 when doing greedy evaluation (i.edo_sample=False n_samples=1`)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

awasthiabhijeet commented 1 year ago

Hi @loubnabnl ,

Surprisingly, with CodeLlama-7B-Instruct and CodeLlama-13B-Instruct, I observe better numbers on HumanEval synthesize than what is reported in the CodeLlama paper.

This is the pass@1 score that CodeLlama paper reports for the Instruct models: image

I observe pass@1 scores of 47 and 50.6 with 7B and 13B Instruct models.

Could this be due to better post-processing in this library? (I am assuming HumanEval and HumanEvalSynthesize are same)

Here is my output for 13B models.

{
  "humanevalsynthesize-python": {
    "pass@1": 0.5060975609756098
  },
  "config": {
    "prefix": "",
    "do_sample": false,
    "temperature": 0.2,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "ckpt_copy",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": true,
    "tasks": "humanevalsynthesize-python",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 2048,
    "precision": "fp16",
    "load_in_8bit": false,
    "load_in_4bit": false,
    "limit": null,
    "limit_start": 0,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "ckpt_copy/evaluation_humanevalsynthesize-python_codellama.json",
    "save_generations": true,
    "save_generations_path": "ckpt_copy/generations_humanevalsynthesize-python_codellama.json",
    "save_references": false,
    "prompt": "codellama",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}
loubnabnl commented 1 year ago

Hello CodeLlama paper doesn't use HumanEvalSynthesize, they just use base HumanEval even for instruct models which explains the gap