bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
782 stars 208 forks source link

Cannot Reproduce SantaCoder pass@1 on HumanEval #138

Closed ZhangzihanGit closed 1 year ago

ZhangzihanGit commented 1 year ago

Hi,

I tried reproducing the pass@1 result of the SantaCoder model using this test suite on HumanEval. However, the pass@1 always be 0.

I manually checked the model generations and found that the model always generates repeated nonsense tokens. For example, for the first test input:

"from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n

and the model generates:

._._._.. is..Get..Get..)._._...._.).Can._........ is. is_._.._)...Get._ can._)_._ can_. can_ can_. is. is_). is.Get..___._ can_ is_._.Get can. is.GetCan_._._ can. can_ can. can can is. default_. is default_ can. == can. default_ can_ can_ can can can.Get). can. default_Get is_ can_ is. can. can_ default_ can. is can. is default_ default_ can. can_ default_ can_ can. can. is default_ can is default can default_ get_ default_ can can is default_ can can_ can. can. can default_Get can. can default can default_Get. can is default_ can default_Get can default_ can is default_ default_ can default_ default_ default_ is can can can is_ default_ can_ default is default_ default_ can default_GetGet, get_ default_Get). can can_ default_Get) can default_ can default_ can_ can. is default_ default_ default_ get_ can. default can can can_ default can default_ default_ default_ default_  default_ default_ default_ default_  default can can can can_ default_ default_ default_ default_ default_  can default_ default_ default_  can_ can_ default_ default_ can  default_  default_ default_ default_ can_ is default_  default_ can_ default_ get default_Get

My execution script is:

accelerate launch main.py \
        --model bigcode/santacoder \
        --max_length_generation 512 \
        --tasks humaneval \
        --precision bf16 \
        --temperature 0.2 \
        --top_k 0 \
        --top_p 0.95 \
        --n_samples 20 \
        --batch_size 20 \
        --seed 10 \
        --generation_only \
        --save_generations \
        --save_generations_path generations_py.json \
        --use_auth_token

I have also tested other models, such as codellama/CodeLlama-7b-hf and meta-llama/Llama-2-7b-hf using the exact same script above (except the model name), and I can reproduce the similar pass@1 score as reported in the papers.

Can you please help with this issue?

Thank you!

ZhangzihanGit commented 1 year ago

Ok, I forgot to add --trust_remote_code. After adding this command, the pass@1 score is reasonable.

ZhangzihanGit commented 1 year ago

Close this issue.