Code-Pile

This repository contains the processing scripts to scrape/process the code-pile dataset.

Project Description
How to use the Code-Pile (todo)
How to Contribute
Additional Resources

Project Description

Check out The code pile proposal

The Code-Pile will be released similar to "the pile" as a folder of .jsonl.zst files, see lm-dataformat

How to use the Code-Pile

It's not finished, ask on discord

How to Contribute

Think about the most usefull Code-data for the next generation of textual Code Models.

The most valuable dataset properties (use your own judgment) are:

Open License
Data quality
Dataset size
Data variance/variety/nicheness
Ease of obtaining/processing

To add a new dataset, open a Issue under given dataset-request template. Gather all the related informations appropriate to it. Use the issue to track.

Check if there is existing Code or someone already working on it: See Additional Resources

Eleuthers Pile V1 Repos
Ask on Carper #code-pile
Ask on Eleuther
Consult the linked Spreadsheets below

Then implement it through the following steps:

Fork this repo
Use the working branch
Read the shared classes in datasets.py and codepile.py
Create mvp/example for your dataset
Create a pull request
Keep building the data-domain specific classes and repeat

Citation Placeholder:

@misc{Code-Pile,
  author = {},
  doi = {},
  month = {},
  title = {},
  url = {https://github.com/CarperAI/Code-Pile},
  version = {},
  year = {2022}
}

Additional Resources

Preliminary spreadsheet of useful resources

Closely related projects:

Previous work:

[Codeparrot] (https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot)
...

CarperAI / Code-Pile

readme

Code-Pile

Table of Contents

Project Description

How to use the Code-Pile

How to Contribute

Additional Resources