bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

support for batch size > 1 for single problem generations (n_samples=1) #23

Open Muennighoff opened 1 year ago

Muennighoff commented 1 year ago

The below works when setting batch_size 1 🧐

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 840.09it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 949.22it/s]
number of problems for this task is 14918
0it [00:06, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 83, in complete_code
    for step, batch in tqdm(enumerate(dataloader)):
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/data_loader.py", line 491, in __iter__
    observed_batch_size = find_batch_size(batch)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/accelerate/utils/operations.py", line 177, in find_batch_size
    raise TypeError(f"Can only find the batch size of tensors but got {type(data)}.")
TypeError: Can only find the batch size of tensors but got <class 'NoneType'>.

Probably related:

(bigcode) niklas@hf-dgx-01:~/bigcode-evaluation-harness$ accelerate launch main.py --model bigcode/christmas-models --revision fim --tasks codexglue_code_to_text-python --limit 8 --max_length_generation 512 --do_sample False --n_samples 100 --batch_size 16
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_cpu_threads_per_process` was set to `64` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Selected Tasks: ['codexglue_code_to_text-python']
Loading the model and tokenizer
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 782.52it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 901.68it/s]
number of problems for this task is 8
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 188, in <module>
    main()
  File "/home/niklas/bigcode-evaluation-harness/main.py", line 175, in main
    results[task] = evaluator.evaluate(task)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 62, in evaluate
    generations, references = self.generate_text(task_name)
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/evaluator.py", line 45, in generate_text
    generations = parallel_generations(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/generation.py", line 82, in parallel_generations
    generations = complete_code(
  File "/home/niklas/bigcode-evaluation-harness/lm_eval/utils.py", line 87, in complete_code
    generated_tokens = accelerator.unwrap_model(model).generate(
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/niklas/miniconda3/envs/bigcode/lib/python3.10/site-packages/transformers/generation/utils.py", line 1513, in generate
    raise ValueError(
ValueError: num_return_sequences has to be 1, but is 16 when doing greedy search.
loubnabnl commented 1 year ago

We currently parallelize generations for tasks that require multiple generations for each problem like HumanEval and MBPP. So batch_size should be increased when the number of candidate solutions n_samples is higher than 1 which is not the case here.

Muennighoff commented 1 year ago

👍 ; Would make sense to also support batch_size to batch multiple examples even when n_sample is 1, no?

loubnabnl commented 1 year ago

Yes definitely! Especially since some benchmarks can have thousands of problems

infinitylogesh commented 1 year ago

I am interested to work on this! If nobody else is currently working on it :)

loubnabnl commented 1 year ago

yes, feel free to work on it!