deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself
https://coder.deepseek.com/
MIT License
6.6k stars 461 forks source link

Code to generate data #131

Open tbressers opened 7 months ago

tbressers commented 7 months ago

Thank you for the best code model to date!

Would it be possible to share the pre-training data generation code? —>

Data Creation

Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data. Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies. Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication. Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.

guoday commented 6 months ago

Hello, there are currently no plans to open-source the pre-training code.