bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Integrate MultiPL-E #44

Closed loubnabnl closed 1 year ago

loubnabnl commented 1 year ago

Integration of MultiPLE HumanEval version in 18 programming languages

arjunguha commented 1 year ago

This code pulls out code that we normally run in the MultiPL-E evaluation container.

I think the easiest way to address the dependency problem is the following:

  1. Tell a user "you had better have dependencies installed!"
  2. We can give them a container with both the PL toochains installed and the eval-harness dependencies, along with some instructions on how to run commands in a container.
loubnabnl commented 1 year ago

Yes exactly! I'll upload some code and instructions to use the container

ytzi commented 1 year ago

Re: performance issues.

I have obtained the following results for Python and Java on HumanEval:

Python:
pass@1: 0.181 // temp 0.2
pass@10: 0.284 // temp 0.8
pass@100: 0.466 // temp 0.8
Java:
pass@1: 0.143, // temp 0.2
pass@10: 0.252, // temp 0.8
pass@100: 0.416 // temp 0.8

Which are pretty consistent with previously self-reported numbers (off by < 0.02).

loubnabnl commented 1 year ago

This implementation now matches original MultiPL-E for all scores including for pass@100 after this fix

{
  "multiple-py": {
    "pass@10": 0.29917045146858745,
    "pass@100": 0.4996997700167089
  },
  "config": {
    "model": "bigcode/santacoder",
    "temperature": 0.8,
    "n_samples": 200
  }
}

merging the PR 🥳