What performance do you get with CodeLlama models on HumanEval synthesize?
Surprisingly, with CodeLlama-7B-Instruct and CodeLlama-13B-Instruct, I observe better numbers on HumanEval synthesize than what is reported in the CodeLlama paper.
This is the pass@1 score that CodeLlama paper reports for the Instruct models:
I observe pass@1 scores of 47 and 50.6 with 7B and 13B Instruct models.
Could this be due to better post-processing in this library? (I am assuming HumanEval and HumanEvalSynthesize are same)
Hi @loubnabnl ,
What performance do you get with CodeLlama models on HumanEval synthesize?
Surprisingly, with CodeLlama-7B-Instruct and CodeLlama-13B-Instruct, I observe better numbers on HumanEval synthesize than what is reported in the CodeLlama paper.
This is the pass@1 score that CodeLlama paper reports for the Instruct models:![image](https://github.com/bigcode-project/bigcode-evaluation-harness/assets/10473221/ed731d8e-09dd-432c-b166-b257d4f79997)
I observe pass@1 scores of 47 and 50.6 with 7B and 13B Instruct models.
Could this be due to better post-processing in this library? (I am assuming HumanEval and HumanEvalSynthesize are same)
Here is my output for 13B models.
CC: @Muennighoff