bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
782 stars 208 forks source link

MBPP eval extremely slow for CodeGen2 and Replit-Code #106

Closed AadSah closed 1 year ago

AadSah commented 1 year ago

Hi, I have been trying to evaluate CodeGen2 and Replit-Code models on the mbpp task, but the code runs extremely slow. While the corresponding eval time for other models is around 2 hours, the ETA for these 2 models varies significantly and sometimes goes up to > 90 hrs. Any help to resolve this issue? Thanks!

loubnabnl commented 1 year ago

The Replit model seems slow because the use_cache argument in their config is set to false, you can try cloning it and changing it to true before you run inference, I opened a PR on their repo. For CodeGen2 which model exactly are you running?

Also what batch size and number of GPUs are you using? You can also try increasing the batch size to speed things up

AadSah commented 1 year ago

Hi @loubnabnl, thanks for your reply! I am running the CodeGen2-3.7B model with a batch size of 10 and a single GPU. Here is the exact command which I am using:

accelerate launch main.py \
  --model Salesforce/codegen2-3_7B \
  --tasks mbpp \
  --temperature 0.1 \
  --n_samples 15 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations \
  --metric_output_path codegen2-3.7B-results.json \
  --save_generations_path codegen2-3.7B-generations.json \
  --trust_remote_code 
loubnabnl commented 1 year ago

Regarding Replit model you should be able to run evaluation in ~2h on 1 gpu in full precision. You can try fp16 or bf16 in --precision argument to speed things up.

As for CodeGen2, if the model inference is slow there's not much we can do about it from the evaluation harness perspective, since the same command runs fast for other models, you can try measuring the tokens/sec speed and contacting the model's authors.

dlvp commented 1 year ago

I confirm that inference with the codegen2 series is extremely slow compared to other models of the same size. Codegen2.5 (7B) on the other hand is fast (but based on a different architectures)