abacaj / code-eval

Run evaluation on LLMs using human-eval benchmark
MIT License
362 stars 34 forks source link

Is llama2-7B-chat weaker thank llama2-7B? #13

Open sunyuhan19981208 opened 8 months ago

sunyuhan19981208 commented 8 months ago

I got only 9.7% for llama2-7B-chat on human-eval using your script

{'pass@1': 0.0975609756097561}
abacaj commented 8 months ago

Hi, I think you will have to make sure the prompt template is correct