Code for evaluation of Euros Models

archiki commented 3 months ago

Can you add the code for reproducing the main results in the paper for various math and coding datasets, along with their prompts and the data splits used?

cgq15 commented 3 months ago

Thanks for your interest! We are working on it right now and we will release the evaluation code soon.

archiki commented 3 months ago

Thanks @cgq15, in the meantime could you let me know the generation configs used for different task types? From what I have been able to reproduce the numbers are not consistent with Table 3 of your paper (see below), and I think it can be attributed to generation configs such as temperature, top_p, top_k, and do_sample as I am using the same prompts as listed in the paper.

Reproduction Study: Dataset: HumanEval, Reported Performance: 55.5, Performance Obtained: 47.56 (temp=0.2, top_p=0.9, top_k=50) Dataset: MATH, Reported Performance: 32.6, Performance Obtained: 29.6 (PoT with above parameters)

Note: I have loaded the model with 8-bit quantization.

cgq15 commented 3 months ago

Hi, we are releasing the eval code today. So please stay tuned. For hyperparameters, we set temperature=0, i.e. greedy decoding for all coding and math tasks. Also, we test the models with float16 weights.

lifan-yuan commented 3 months ago

Hi @archiki,

we have released the eval code. Enjoy!

archiki commented 2 months ago

Thanks a lot! @lifan-yuan can you clarify if the performance reported on MBPP and HumanEval are on the regular test sets or the EvalPlus suites? TIA!

OpenBMB / Eurus

Code for evaluation of Euros Models #2