Closed mnoukhov closed 1 year ago
Can you check what the generations look like to see if there's a failure pattern? You can save them by calling --save_generations
.
In CodeCapybara did you also use huggyllama/llama-7b
or the one they use in the repo decapoda-research/llama-7b-hf
?
Closing this issue as it was fixed with https://github.com/bigcode-project/bigcode-evaluation-harness/pull/81 Your command now returns
{
"humaneval": {
"pass@1": 0.10518292682926832,
"pass@10": 0.1760411160613154
},
"config": {
"model": "huggyllama/llama-7b",
"temperature": 0.2,
"n_samples": 20
}
}
Running
human_eval
with Llama 7B gets 0 for pass@1,10 but it does achieve the correct values (pass@1 ~ 10) in other repos.To reproduce, simply run
which returns
I reduced
n_samples
from 200 to 20 to make this run in about 1 hour on a single A100.Running the same
human_eval
with CodeCapybara's repo gets the correct values{'pass@1': 0.09542682926829267, 'pass@10': 0.12530930709402355}
Could be related to https://github.com/huggingface/transformers/pull/22402 although I tried explicitly setting eos, bos, pad token ids same as CodeCapybara (see here) and didn't see a change so might be something else.
If anyone has successfully run it here, would appreciate some tips!