bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
782 stars 208 forks source link

Evaluation of the task humanevalexplaindescribe #136

Closed wang-weishi closed 1 year ago

wang-weishi commented 1 year ago

I have the model's generation and I am planning to evaluate it. However, I observed such an error message: ValueError("""ExplainDescribe should be run with --generation_only May I know valid the evaluation only mode for humaneval tasks?

python main.py --tasks humanevalexplaindescribe-python --allow_code_execution --load_generations_path generations_humanevalexplaindescribepython_octocoder.json --model bigcode/octocoder --load_data_path generations_humanevalexplaindescribepython_octocoder.json --n_samples=1

wang-weishi commented 1 year ago

One more question. Are llama and llama 2 supported by this framework? Thank you so much for your help and patience.

loubnabnl commented 1 year ago

Hi, yes llama and llama-2 are supported in the framework.

The evaluation only setup is triggered by --load_generations_path flag for all tasks, except HumanEvalExplainDescribe where you generate code (text) explanations so there's isn't code to execute, those generations should be used in a second step with humanevalexplainsynthesize-python to generate code from these explanations to assess their quality. See the docs for more details on HumanEvalExplain task.

Below is the command for running OctoCoder on this task (in the second step we load the generations with --load_data_path and not --load_generations_path since it's extra data used for the generation not the final generations we want to execute).

# step 1: generate explanations
accelerate launch main.py \
--model bigcode/octocoder  \
--tasks humanevalexplaindescribe-python \
--generation_only \
--do_sample True \
--temperature 0.2 \
--n_samples 20 \
--batch_size 5 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--save_generations_path generations_humanevalexplaindescribepython_octocoder.json \
--max_length_generation 2048 \
--precision bf16
# step 2: generate code from explanations and execute it
accelerate launch main.py \
--model bigcode/octocoder  \
--tasks humanevalexplainsynthesize-python \
--do_sample True \
--temperature 0.2 \
--n_samples 1 \
--batch_size 1 \
--allow_code_execution \
--save_generations \
--trust_remote_code \
--prompt octocoder \
--load_data_path generations_humanevalexplaindescribepython_octocoder.json \
--save_generations_path generations_humanevalexplainsynthesizepython_octocoder.json \
--metric_output_path evaluation_humanevalexplainpython_octocoder.json \
--max_length_generation 2048 \
--precision bf16

cc @Muennighoff

wang-weishi commented 1 year ago

Thank you for the prompt response, and thanks again for the excellent contribution to the research society.

wang-weishi commented 1 year ago

@loubnabnl, thanks for the evaluation steps. I have faced this error when loading the dataset. May I know how to solve this issue? Thank you.

def get_dataset(self): """Returns dataset for the task or an iterable of any object, that get_prompt can handle""" dataset = [] for description, sample in zip(self.descriptions, self.dataset["test"]): for description_candidate in description: dataset.append({"description": description_candidate} | sample) return dataset

Traceback (most recent call last): File "main.py", line 328, in <module> main() File "main.py", line 314, in main results[task] = evaluator.evaluate(task) File "/home/weishi/bigcode-evaluation-harness/lm_eval/evaluator.py", line 76, in evaluate generations, references = self.generate_text(task_name) File "/home/weishi/bigcode-evaluation-harness/lm_eval/evaluator.py", line 43, in generate_text dataset = task.get_dataset() File "/home/weishi/bigcode-evaluation-harness/lm_eval/tasks/humanevalpack.py", line 622, in get_dataset dataset.append({"description": description_candidate} | sample) TypeError: unsupported operand type(s) for |: 'dict' and 'dict'

Muennighoff commented 1 year ago

@loubnabnl, thanks for the evaluation steps. I have faced this error when loading the dataset. May I know how to solve this issue? Thank you.

def get_dataset(self): """Returns dataset for the task or an iterable of any object, that get_prompt can handle""" dataset = [] for description, sample in zip(self.descriptions, self.dataset["test"]): for description_candidate in description: dataset.append({"description": description_candidate} | sample) return dataset

Traceback (most recent call last): File "main.py", line 328, in <module> main() File "main.py", line 314, in main results[task] = evaluator.evaluate(task) File "/home/weishi/bigcode-evaluation-harness/lm_eval/evaluator.py", line 76, in evaluate generations, references = self.generate_text(task_name) File "/home/weishi/bigcode-evaluation-harness/lm_eval/evaluator.py", line 43, in generate_text dataset = task.get_dataset() File "/home/weishi/bigcode-evaluation-harness/lm_eval/tasks/humanevalpack.py", line 622, in get_dataset dataset.append({"description": description_candidate} | sample) TypeError: unsupported operand type(s) for |: 'dict' and 'dict'

What is your Python version? This line of the code needs >=Python 3.9 You can probably also simply change the line not to use the | operator