Closed zhksh closed 1 year ago
Hi, we've found that Phind models need to have a \n
at the end of the prompt, while humaneval strips the prompt by default, try humaneval-unstripped
task instead.
If you check the BigCode Leaderboard we got 71.95 pass@1 for this model: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
For MultiPL-E we explicitly made sure to have a \n
at the end of prompts, but it's not implemented yet in main.
il check that out thx
I evaluated Phind-CodeLlama-34B-v2 according to their recipe (using OAIs lib) https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/blob/main/README.md#how-to-reproduce-humaneval-results and got a score around 70 when trying to replicate with bigcode-evaluation-harness with: --model Phind-CodeLlama-34B-v2
--tasks multiple-py and --model Phind-CodeLlama-34B-v2
--tasks humaneval (from my understanding those should be equivalent) and get a scores around 34 .
From my understanding all 3 ways should yield similar scores. Does someone have an idea or am i misunderstanding sth here ?