Closed loubnabnl closed 1 year ago
This code pulls out code that we normally run in the MultiPL-E evaluation container.
I think the easiest way to address the dependency problem is the following:
Yes exactly! I'll upload some code and instructions to use the container
Re: performance issues.
I have obtained the following results for Python and Java on HumanEval:
Python:
pass@1: 0.181 // temp 0.2
pass@10: 0.284 // temp 0.8
pass@100: 0.466 // temp 0.8
Java:
pass@1: 0.143, // temp 0.2
pass@10: 0.252, // temp 0.8
pass@100: 0.416 // temp 0.8
Which are pretty consistent with previously self-reported numbers (off by < 0.02).
This implementation now matches original MultiPL-E for all scores including for pass@100 after this fix
{
"multiple-py": {
"pass@10": 0.29917045146858745,
"pass@100": 0.4996997700167089
},
"config": {
"model": "bigcode/santacoder",
"temperature": 0.8,
"n_samples": 200
}
}
merging the PR 🥳
Integration of MultiPLE HumanEval version in 18 programming languages