Closed feiyuvl closed 1 year ago
In our experiment, repo-level code would slightly hurt performance of benchmark. However, it will enhance long context modeling and repo-level completion.
@guoday Thanx for your reply. Another question, how many epochs do you train on the 2T tokens dataset?
4-5 epochs
So the total training tokens is about 10T? Amazing
No. The total data is about 400-500B. We train the model for 4-5 epochs, resulting 2T.
This amount is equal to the starcoder dataset. Do you possibly resuse the stack dataset, and apply a more advanced data filter alogrithm, or do you download more code data than stack, and implment more stricted data cleaning algorithms?
Deepseek use the repo-level files, so it can't resuse the stack dataset. Which is the key contrbution to this amazing improvment over starcoder, code quaility filtering, second pretraining or something else?
The primary contributions come from many aspects, including data collection, deduplication, filtering, the total number of tokens for pre-training, and the methods of pre-training. The optimization of each process is essential for enhancing the model. We will analyze this further in our later technical reports.
Thank you, looking forward to your technical reports
Hey, thanks for the great work! can you share the script of dependency parsing and repo level dedup?
Very impressive work. Does the repo-level code conact improve humaneval result?