Impact of repo-level code concat

deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself

https://coder.deepseek.com/

MIT License

6.85k stars 473 forks source link

Impact of repo-level code concat #9

Closed feiyuvl closed 1 year ago

feiyuvl commented 1 year ago

Very impressive work. Does the repo-level code conact improve humaneval result?

guoday commented 1 year ago

In our experiment, repo-level code would slightly hurt performance of benchmark. However, it will enhance long context modeling and repo-level completion.

feiyuvl commented 1 year ago

@guoday Thanx for your reply. Another question, how many epochs do you train on the 2T tokens dataset?

guoday commented 1 year ago

4-5 epochs

feiyuvl commented 1 year ago

So the total training tokens is about 10T? Amazing

guoday commented 1 year ago

No. The total data is about 400-500B. We train the model for 4-5 epochs, resulting 2T.

feiyuvl commented 1 year ago

This amount is equal to the starcoder dataset. Do you possibly resuse the stack dataset, and apply a more advanced data filter alogrithm, or do you download more code data than stack, and implment more stricted data cleaning algorithms?

feiyuvl commented 1 year ago

Deepseek use the repo-level files, so it can't resuse the stack dataset. Which is the key contrbution to this amazing improvment over starcoder, code quaility filtering, second pretraining or something else?

guoday commented 1 year ago

The primary contributions come from many aspects, including data collection, deduplication, filtering, the total number of tokens for pre-training, and the methods of pre-training. The optimization of each process is essential for enhancing the model. We will analyze this further in our later technical reports.

feiyuvl commented 1 year ago

Thank you, looking forward to your technical reports

wentinghome commented 3 months ago

Hey, thanks for the great work! can you share the script of dependency parsing and repo level dedup?