Would it be possible to share the pre-training data generation code? —>
Data Creation
Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data.
Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.
Thank you for the best code model to date!
Would it be possible to share the pre-training data generation code? —>
Data Creation
Step 1: Collect code data from GitHub and apply the same filtering rules as StarCoder Data to filter data. Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies. Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication. Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.