bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
744 stars 193 forks source link

Llama 7B fails for Human Eval #74

Closed mnoukhov closed 1 year ago

mnoukhov commented 1 year ago

Running human_eval with Llama 7B gets 0 for pass@1,10 but it does achieve the correct values (pass@1 ~ 10) in other repos.

To reproduce, simply run

accelerate launch  main.py \
  --model huggyllama/llama-7b \
  --max_length_generation 512 \
  --tasks humaneval \
  --temperature 0.2 \
  --n_samples 20 \
  --batch_size 10 \
  --allow_code_execution

which returns

"humaneval": {
    "pass@1": 0.0,
    "pass@10": 0.0
  },

I reduced n_samples from 200 to 20 to make this run in about 1 hour on a single A100.

Running the same human_eval with CodeCapybara's repo gets the correct values {'pass@1': 0.09542682926829267, 'pass@10': 0.12530930709402355}

Could be related to https://github.com/huggingface/transformers/pull/22402 although I tried explicitly setting eos, bos, pad token ids same as CodeCapybara (see here) and didn't see a change so might be something else.

If anyone has successfully run it here, would appreciate some tips!

loubnabnl commented 1 year ago

Can you check what the generations look like to see if there's a failure pattern? You can save them by calling --save_generations. In CodeCapybara did you also use huggyllama/llama-7b or the one they use in the repo decapoda-research/llama-7b-hf ?

loubnabnl commented 1 year ago

Closing this issue as it was fixed with https://github.com/bigcode-project/bigcode-evaluation-harness/pull/81 Your command now returns

{
  "humaneval": {
    "pass@1": 0.10518292682926832,
    "pass@10": 0.1760411160613154
  },
  "config": {
    "model": "huggyllama/llama-7b",
    "temperature": 0.2,
    "n_samples": 20
  }
}