bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
702 stars 180 forks source link

Phi 1.5 evaluation problem #150

Closed Anindyadeep closed 8 months ago

Anindyadeep commented 8 months ago

System info

Google colab free version 

The command I used:

!accelerate launch  main.py \
  --model microsoft/phi-1_5 \
  --tasks humanevalsynthesize-rust \
  --limit 10 \
  --temperature 0 \
  --do_sample True \
  --n_samples 100 \
  --batch_size 1 \
  --allow_code_execution \
  --save_generations \
  --trust_remote_code

I got this error:

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2023-10-18 18:05:56.298975: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Selected Tasks: ['humanevalsynthesize-rust']
Loading model in fp32
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading pytorch_model.bin: 100% 2.84G/2.84G [00:17<00:00, 163MB/s]
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading (…)neration_config.json: 100% 69.0/69.0 [00:00<00:00, 404kB/s]
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py:655: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
Downloading (…)okenizer_config.json: 100% 237/237 [00:00<00:00, 1.48MB/s]
Downloading (…)olve/main/vocab.json: 100% 798k/798k [00:00<00:00, 14.6MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 2.32MB/s]
Downloading (…)/main/tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 16.9MB/s]
Downloading (…)in/added_tokens.json: 100% 1.08k/1.08k [00:00<00:00, 5.61MB/s]
Downloading (…)cial_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 492kB/s]
Downloading builder script: 100% 6.22k/6.22k [00:00<00:00, 25.5MB/s]
Downloading readme: 100% 7.59k/7.59k [00:00<00:00, 26.3MB/s]
Downloading data: 100% 497k/497k [00:00<00:00, 9.41MB/s]
Generating test split: 164 examples [00:00, 2090.55 examples/s]
number of problems for this task is 10
  0% 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/bigcode-evaluation-harness/main.py", line 356, in <module>
    main()
  File "/content/bigcode-evaluation-harness/main.py", line 342, in main
    results[task] = evaluator.evaluate(task)
  File "/content/bigcode-evaluation-harness/lm_eval/evaluator.py", line 76, in evaluate
    generations, references = self.generate_text(task_name)
  File "/content/bigcode-evaluation-harness/lm_eval/evaluator.py", line 55, in generate_text
    generations = parallel_generations(
  File "/content/bigcode-evaluation-harness/lm_eval/generation.py", line 127, in parallel_generations
    generations = complete_code(
  File "/content/bigcode-evaluation-harness/lm_eval/utils.py", line 243, in complete_code
    for step, batch in tqdm(
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 560, in __iter__
    next_batch, next_batch_info = self._fetch_batches(main_iterator)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 523, in _fetch_batches
    batches.append(next(iterator))
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/content/bigcode-evaluation-harness/lm_eval/utils.py", line 53, in __iter__
    prompt_contents = self.task.get_prompt(self.dataset[sample])
  File "/content/bigcode-evaluation-harness/lm_eval/tasks/humanevalpack.py", line 657, in get_prompt
    return super().get_prompt(prompt_base, instruction)
  File "/content/bigcode-evaluation-harness/lm_eval/tasks/humanevalpack.py", line 229, in get_prompt
    raise NotImplementedError
NotImplementedError
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'main.py', '--model', 'microsoft/phi-1_5', '--tasks', 'humanevalsynthesize-rust', '--limit', '10', '--temperature', '0', '--do_sample', 'True', '--n_samples', '100', '--batch_size', '1', '--allow_code_execution', '--save_generations', '--trust_remote_code']' returned non-zero exit status 1.

It seems like a NotImplementedError, so is this evaluation not available for Phi models?

loubnabnl commented 8 months ago

Hello, the harness supports all decoder transformers models. The error is because you didn't provide a --prompt argument in you command, I'll update the code to show a more informative message.

HumanEvaSynthesize requires providing a prompt (see docs) type which says how the instruction should be formatted, you can use --prompt instruct for example which concatenates the english instruction with function signature (see)

Also, are you trying to do greedy since you set temperature to 0? In that case I advice you do --do_sample False instead, and you won't need a 100 samples per problem since we don't sample in greedy and generations will be the same, you can try this for evaluation on the first 10 problems in greedy:

accelerate launch  main.py \
  --model microsoft/phi-1_5 \
  --tasks humanevalsynthesize-rust \
  --limit 10 \
  --do_sample False \
  --prompt instruct \
  --n_samples 1 \
  --batch_size 1 \
  --allow_code_execution \
  --save_generations \
  --trust_remote_code \
  --max_length_generation 2048 \
Anindyadeep commented 8 months ago

Hello, the harness supports all decoder transformers models. The error is because you didn't provide a --prompt argument in you command, I'll update the code to show a more informative message.

HumanEvaSynthesize requires providing a prompt (see docs) type which says how the instruction should be formatted, you can use --prompt instruct for example which concatenates the english instruction with function signature (see)

Also, are you trying to do greedy since you set temperature to 0? In that case I advice you do --do_sample False instead, and you won't need a 100 samples per problem since we don't sample in greedy and generations will be the same, you can try this for evaluation on the first 10 problems in greedy:

accelerate launch  main.py \
  --model microsoft/phi-1_5 \
  --tasks humanevalsynthesize-rust \
  --limit 10 \
  --do_sample False \
  --prompt instruct \
  --n_samples 1 \
  --batch_size 1 \
  --allow_code_execution \
  --save_generations \
  --trust_remote_code \
  --max_length_generation 2048 \

Thanks and it worked. At the time I put the issue, I was new here. But yes, implementing those strategies worked. Thanks