bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
744 stars 193 forks source link

Getting Zeros for StarCoder on multiple-js #94

Closed amitbcp closed 1 year ago

amitbcp commented 1 year ago

I am running the following :

accelerate launch  main.py   \
  --model bigcode/starcoder   \
  --max_length_generation 512  \
  --tasks multiple-js   \
  --n_samples 120  \
  --batch_size 10  \
  --temperature 0.2  \
  --precision bf16  \
  --allow_code_execution   --use_auth_token

The results is :

{
  "multiple-js": {
    "pass@1": 0.0,
    "pass@10": 0.0,
    "pass@100": 0.0
  },
  "config": {
    "model": "bigcode/starcoderbase",
    "temperature": 0.1,
    "n_samples": 120
  }
}

Is their any other parameters that I might be missing ?

loubnabnl commented 1 year ago

Did you run the execution in a docker with all installed dependencies?

arjunguha commented 1 year ago

I'm going to guess that this is a Node version issue. The MultiPL-E JS benchmarks rely on deepEqual which require a fairly recent version of Node. I think the version in Ubuntu 20.04 is too old, but it works in Ubuntu 22.04.

amitbcp commented 1 year ago

I was trying it on my local setup with all the dependencies. Let me try and use the docker. Can you please confirm that the command/hyper-parameter shared above are correct. @loubnabnl @arjunguha

arjunguha commented 1 year ago

Correct. You may get something slightly lower than what is reported in the paper. The original MultiPL-E code (github.com/nuprl/MultiPL-E) uses length 512, but interprets it as len(prompt_tokens) + 512. The evaluation harness I believe includes the prompt in the 512 tokens. So, you may need to increased it.

amitbcp commented 1 year ago

I was able to reproduce the results. Thanks