bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
830 stars 219 forks source link

Fix MBPP bug with transformers 4.38+ #236

Closed edgan8 closed 5 months ago

edgan8 commented 6 months ago

Ever since transformers 4.38, the library will raise an exception if max_length is set to a value that doesn't include the input size. This means that bigcode will fail when running mbpp. However, we need to be able to set max length to replicate mbpp results effectively.

For example, the following code throws an exception with transformers 4.38 but not with 4.37.2:

accelerate launch main.py \
  --model ~/models/hf-code-llama-7b-instruct \
  --tasks mbpp \
  --max_length_generation 512 \
  --allow_code_execution \
  --precision bf16 \
  --do_sample False
 96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎    | 120/125 [25:57<
01:04, 12.98s/it]                                                                                                                         
[rank2]: Traceback (most recent call last):                                                                                               [rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/main.py", line 412, in <module>                                                  
[rank2]:     main()                                                                                                                       
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/main.py", line 396, in main                                                      [rank2]:     results[task] = evaluator.evaluate(                                                                                          
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/bigcode_eval/evaluator.py", line 95, in evaluate                                 
[rank2]:     generations, references = self.generate_text(task_name, intermediate_generations=intermediate_generations)                   
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/bigcode_eval/evaluator.py", line 69, in generate_text                            
[rank2]:     generations = parallel_generations(                                                                                          
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/bigcode_eval/generation.py", line 141, in parallel_generations                   
[rank2]:     generations = complete_code(                                                                                                 
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/bigcode_eval/utils.py", line 300, in complete_code                               
[rank2]:     generated_tokens = model.generate(                                                                                           
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/venv_bug/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in d
ecorate_context                                                                                                                           
[rank2]:     return func(*args, **kwargs)                                                                                                 
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/venv_bug/lib/python3.10/site-packages/transformers/generation/utils.py", line 162
6, in generate                                                       
[rank2]:     self._validate_generated_length(generation_config, input_ids_length, has_default_max_length)                                 
[rank2]:   File "/home/ubuntu/bigcode-evaluation-harness/venv_bug/lib/python3.10/site-packages/transformers/generation/utils.py", line 117
6, in _validate_generated_length                                                                                                          
[rank2]:     raise ValueError(                                                                                                            
[rank2]: ValueError: Input length of input_ids is 512, but `max_length` is set to 512. This can lead to unexpected behavior. You should co
nsider increasing `max_length` or, better yet, setting `max_new_tokens`.                                                                  
W0520 22:48:43.069000 140562053400384 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2843 closing signal SIGTERM   
W0520 22:48:43.070000 140562053400384 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2844 closing signal SIGTERM   
W0520 22:48:43.070000 140562053400384 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2846 closing signal SIGTERM   
E0520 22:48:43.448000 140562053400384 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 2 (pid: 2845)
 of binary: /home/ubuntu/bigcode-evaluation-harness/venv_bug/bin/python                                                                   
loubnabnl commented 5 months ago

Hi, I think it's better to have the generation fail so the user increases max_length to fit all the prompts, than silently truncating and getting lower scores. So you should use a larger max_length for MBPP, such as 1024

aladinggit commented 5 months ago

Hi, even when i use 1024 as the length, it still fails and it shows exactly the same error message. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch main.py --model meta-llama/Llama-2-7b-hf --tasks mbpp --precision bf16 --max_length_generation 1024 --allow_code_execution The error message: ValueError: Input length of input_ids is 1024, but max_length is set to 1024. This can lead to unexpected behavior. You should consider incr easing max_length or, better yet, setting max_new_tokens.ValueError : Input length of input_ids is 1024, but max_length is set to 1024. This can lead to unexpected behavior. You should consider increasing ma x_length or, better yet, setting max_new_tokens.

How can the input length got increased with the max length?

edgan8 commented 5 months ago

Apparently you need max_length = 2048. This is unreasonably high, esp. since some base models may not even support such a long context.

edgan8 commented 5 months ago

@loubnabnl what do you think about this PR to catch the exception and turn it into a warning: https://github.com/bigcode-project/bigcode-evaluation-harness/pull/244

loubnabnl commented 5 months ago

Yes that works, I approved the PR

loubnabnl commented 5 months ago

Closing as the PR was merged, thanks for flagging the issue