Open archiki opened 3 months ago
Thanks for your interest! We are working on it right now and we will release the evaluation code soon.
Thanks @cgq15, in the meantime could you let me know the generation configs used for different task types? From what I have been able to reproduce the numbers are not consistent with Table 3 of your paper (see below), and I think it can be attributed to generation configs such as temperature
, top_p
, top_k
, and do_sample
as I am using the same prompts as listed in the paper.
Reproduction Study: Dataset: HumanEval, Reported Performance: 55.5, Performance Obtained: 47.56 (temp=0.2, top_p=0.9, top_k=50) Dataset: MATH, Reported Performance: 32.6, Performance Obtained: 29.6 (PoT with above parameters)
Note: I have loaded the model with 8-bit quantization.
Hi, we are releasing the eval code today. So please stay tuned.
For hyperparameters, we set temperature=0
, i.e. greedy decoding for all coding and math tasks. Also, we test the models with float16 weights.
Hi @archiki,
we have released the eval code. Enjoy!
Thanks a lot! @lifan-yuan can you clarify if the performance reported on MBPP and HumanEval are on the regular test sets or the EvalPlus suites? TIA!
Can you add the code for reproducing the main results in the paper for various math and coding datasets, along with their prompts and the data splits used?