bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
781 stars 208 forks source link

Score discrepancy with humaneval #159

Closed zhksh closed 10 months ago

zhksh commented 11 months ago

I evaluated Phind-CodeLlama-34B-v2 according to their recipe (using OAIs lib) https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/blob/main/README.md#how-to-reproduce-humaneval-results and got a score around 70 when trying to replicate with bigcode-evaluation-harness with: --model Phind-CodeLlama-34B-v2
--tasks multiple-py and --model Phind-CodeLlama-34B-v2
--tasks humaneval (from my understanding those should be equivalent) and get a scores around 34 .

From my understanding all 3 ways should yield similar scores. Does someone have an idea or am i misunderstanding sth here ?

loubnabnl commented 11 months ago

Hi, we've found that Phind models need to have a \n at the end of the prompt, while humaneval strips the prompt by default, try humaneval-unstripped task instead.

If you check the BigCode Leaderboard we got 71.95 pass@1 for this model: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard

For MultiPL-E we explicitly made sure to have a \n at the end of prompts, but it's not implemented yet in main.

zhksh commented 11 months ago

il check that out thx