abacaj / code-eval

Run evaluation on LLMs using human-eval benchmark
MIT License
359 stars 34 forks source link

Any plans on running evals for codellama? #11

Open ErikBjare opened 10 months ago

ErikBjare commented 10 months ago

I'm keeping https://github.com/ErikBjare/are-copilots-local-yet up-to-date, and would love to see some codellama numbers given it's now SOTA :)

nicoladainese96 commented 10 months ago

I would be interested in this as well. I had some attempts on my own for the Python-7B and Instruct-7B models but if I use the same code of Llama-2 the performance is horrible (e.g., 3 and 8% respectively). As a comparison, with the same exact code, Llama-2-chat-7b gives me 11%.

smart-lty commented 6 months ago

I would be interested in this as well. I had some attempts on my own for the Python-7B and Instruct-7B models but if I use the same code of Llama-2 the performance is horrible (e.g., 3 and 8% respectively). As a comparison, with the same exact code, Llama-2-chat-7b gives me 11%.

I meet the same situation. Even if I try to use instructions in "core/prompts.py", the performance for codellama-7b is 22.8% for pass@1, still lower than the reported number in official document by a large margin. Have you fixed this problem?