I tried reproducing the pass@1 result of the SantaCoder model using this test suite on HumanEval. However, the pass@1 always be 0.
I manually checked the model generations and found that the model always generates repeated nonsense tokens. For example, for the first test input:
"from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n
and the model generates:
._._._.. is..Get..Get..)._._...._.).Can._........ is. is_._.._)...Get._ can._)_._ can_. can_ can_. is. is_). is.Get..___._ can_ is_._.Get can. is.GetCan_._._ can. can_ can. can can is. default_. is default_ can. == can. default_ can_ can_ can can can.Get). can. default_Get is_ can_ is. can. can_ default_ can. is can. is default_ default_ can. can_ default_ can_ can. can. is default_ can is default can default_ get_ default_ can can is default_ can can_ can. can. can default_Get can. can default can default_Get. can is default_ can default_Get can default_ can is default_ default_ can default_ default_ default_ is can can can is_ default_ can_ default is default_ default_ can default_GetGet, get_ default_Get). can can_ default_Get) can default_ can default_ can_ can. is default_ default_ default_ get_ can. default can can can_ default can default_ default_ default_ default_ default_ default_ default_ default_ default can can can can_ default_ default_ default_ default_ default_ can default_ default_ default_ can_ can_ default_ default_ can default_ default_ default_ default_ can_ is default_ default_ can_ default_ get default_Get
I have also tested other models, such as codellama/CodeLlama-7b-hf and meta-llama/Llama-2-7b-hf using the exact same script above (except the model name), and I can reproduce the similar pass@1 score as reported in the papers.
Hi,
I tried reproducing the pass@1 result of the SantaCoder model using this test suite on HumanEval. However, the pass@1 always be 0.
I manually checked the model generations and found that the model always generates repeated nonsense tokens. For example, for the first test input:
and the model generates:
My execution script is:
I have also tested other models, such as
codellama/CodeLlama-7b-hf
andmeta-llama/Llama-2-7b-hf
using the exact same script above (except the model name), and I can reproduce the similar pass@1 score as reported in the papers.Can you please help with this issue?
Thank you!