bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
698 stars 180 forks source link

The results of codellama-7b-hf pass@1 is worse than paper #220

Closed PeiqinSun closed 2 months ago

PeiqinSun commented 2 months ago

This is my commands:

accelerate launch  main.py \
  --model codellama/CodeLlama-7b-hf \
  --tasks humaneval \
  --max_length_generation 512 \
  --do_sample False \
  --n_samples 1 \
  --batch_size 1 \
  --precision bf16 \
  --allow_code_execution \
  --trust_remote_code \
  --save_generations

This is my results:

image

So, where is my wrong?

loubnabnl commented 2 months ago

Hi, that's also the score we got for the BigCode leaderboard, it's still in a similar range to what's reported in the paper (29.8 vs 32.3). The difference could be due to using different post-processing or inference settings.

PeiqinSun commented 2 months ago

Thanks for your reply.