bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
825 stars 219 forks source link

Using the humanevalpack to test the ChatGLM3 model results in an abnormal score. #251

Open burger-pb opened 4 months ago

burger-pb commented 4 months ago

Hi When I tried to test the ChatGLM3 model using the humanevalfixdocs-python task in humanevalpack, an abnormality occurred with a score of 0. The command used is as follows.

accelerate launch main.py \ --model THUDM/chatglm3-6b \ --left_padding \ --tasks humanevalfixdocs-python \ --max_length_generation 2048 \ --prompt chatglm3 \ --trust_remote_code \ --temperature 0.7 \ --do_sample True \ --n_samples 1 \ --batch_size 64 \ --precision bf16 \ --allow_code_execution \ --save_generations

The prompt I used is as follows. elif self.prompt == "chatglm3": prompt = f"<|user|>{inp}<|assistant|>{prompt_base}" else: raise ValueError(f"The --prompt argument {self.prompt} wasn't provided or isn't supported")

The issue I encountered is that the model did not generate any answers in the generations. The answer to the first question is as follows, and it can be seen that no response from the model was obtained.

from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n given threshold.\n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n False\n >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n True\n \"\"\"