Closed s-JoL closed 4 months ago
The main reason is the inconsistency in scripts and the difference in model precision. In Appendix B, we evaluate the model during pre-training using our framework hai-llm
instead huggingface
. However, when converting to an huggingface
model, we use an open-source script for evaluation, which is the result of Section 4.1. Overall, the scripts are slightly different, and the model precision also varies during the conversion process, leading to inconsistent results.
Thank you for your explanation.
Additionally, I have another question. I noticed that in both GitHub and papers, it is mentioned that models from 1.3B to 33B were trained using 2T of data. However, in Appendix B, it seems that the 1.3B model was trained using only 1T of data. Is the image in Appendix B incomplete? @guoday
1.3B model is trained using 1T of data.
Firstly, I would like to express my gratitude for your exceptional work. While reading through your paper, I encountered a question regarding the results presented in Appendix B compared to those mentioned in Section 4.1.
In Appendix B, it is shown that for the 1.3B model, the performance on HumanEval is slightly below 30, and on MBPP, it is slightly below 40. However, in the main text, specifically in Section 4.1, the HumanEval score is reported as 34.8, and the MBPP score is 46.2. Could you please clarify how this discrepancy arises?