In your blog and paper you reported results which show 88.4 score of 7B instruction tuned model on HumanEval benchmark. However evaluation/eval_plus/released/results/humaneval/codeqwen_chat.txtshows only 0.835, and exact same number I get when running evaluation locally. Could you please elaborate on that and help me reproduce the reported numbers?
In your blog and paper you reported results which show 88.4 score of 7B instruction tuned model on HumanEval benchmark. However
evaluation/eval_plus/released/results/humaneval/codeqwen_chat.txt
shows only 0.835, and exact same number I get when running evaluation locally. Could you please elaborate on that and help me reproduce the reported numbers?