Clarification Request on Discrepancies Between Appendix B and Section 4.1 Results

deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself

https://coder.deepseek.com/

MIT License

6.01k stars 433 forks source link

Clarification Request on Discrepancies Between Appendix B and Section 4.1 Results #119

Closed s-JoL closed 4 months ago

s-JoL commented 4 months ago

Firstly, I would like to express my gratitude for your exceptional work. While reading through your paper, I encountered a question regarding the results presented in Appendix B compared to those mentioned in Section 4.1.

In Appendix B, it is shown that for the 1.3B model, the performance on HumanEval is slightly below 30, and on MBPP, it is slightly below 40. However, in the main text, specifically in Section 4.1, the HumanEval score is reported as 34.8, and the MBPP score is 46.2. Could you please clarify how this discrepancy arises?

guoday commented 4 months ago

The main reason is the inconsistency in scripts and the difference in model precision. In Appendix B, we evaluate the model during pre-training using our framework hai-llm instead huggingface. However, when converting to an huggingface model, we use an open-source script for evaluation, which is the result of Section 4.1. Overall, the scripts are slightly different, and the model precision also varies during the conversion process, leading to inconsistent results.

s-JoL commented 4 months ago

Thank you for your explanation.

s-JoL commented 4 months ago

Additionally, I have another question. I noticed that in both GitHub and papers, it is mentioned that models from 1.3B to 33B were trained using 2T of data. However, in Appendix B, it seems that the 1.3B model was trained using only 1T of data. Is the image in Appendix B incomplete? @guoday

guoday commented 4 months ago

1.3B model is trained using 1T of data.