failed evaluation on GSM8K

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

744 stars 193 forks source link

failed evaluation on GSM8K #98

Closed tangzhy closed 1 year ago

tangzhy commented 1 year ago

I try to run your code in a docker container from ghcr.io/bigcode-project/evaluation-harness.

The exact bash command is

accelerate launch  main.py \
  --model bigcode/starcoder \
  --use_auth_token \
  --max_length_generation 512 \
  --tasks pal-gsm8k-greedy \
  --n_samples 1 \
  --temperature 0 \
  --batch_size 1 \
  --do_sample False \
  --allow_code_execution \
  --save_generations \
  --save_generations_path ./output/starcoder_on_gsm8k.json

However, it returns the following:

Evaluating generations...
{
  "pal-gsm8k-greedy": {
    "accuracy": 0.0,
    "num_failed_execution": 1319
  },
  "config": {
    "model": "bigcode/starcoder",
    "revision": null,
    "temperature": 0.0,
    "n_samples": 1
  }
}

where the saved generation contents are like:

Any solutions?

Vipitis commented 1 year ago

I believe you have to run with --use_auth_token as well as --trust_remote_code for models like starcoder since you need to agree to the terms to use them. I do believe it would be better for the evaluation to throw and error instead of running with these erroneous generations.

tangzhy commented 1 year ago

I believe you have to run with --use_auth_token as well as --trust_remote_code for models like starcoder since you need to agree to the terms to use them. I do believe it would be better for the evaluation to throw and error instead of running with these erroneous generations.

I've try these, but the problem remains same. I think it may result from the parallel_generation and it's the reproduction failure which shall be addressed by the official repo team.

tangzhy commented 1 year ago

It turns out to be max_length issue, where instead of 512, we should choose 2048 for this task.

Many fresh guys don't know how to choose an appropriate max_length, maybe the document should provide a default length.

infinitylogesh commented 1 year ago

@tangzhy , Thank you for your inputs. The Prompts for GSM8K take up ~1500 tokens. The max_length has to be greater than that. We will update the docs to make this clear.