Code to generate data - Githubissues

Thank you for the best code model to date!

Would it be possible to share the pre-training data generation code? —>

Data Creation

Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data. Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies. Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication. Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.

deepseek-ai / DeepSeek-Coder

Code to generate data #131