This repository contains the processing scripts to scrape/process the code-pile dataset.
Check out The code pile proposal
The Code-Pile will be released similar to "the pile" as a folder of .jsonl.zst files, see lm-dataformat
It's not finished, ask on discord
Think about the most usefull Code-data for the next generation of textual Code Models.
The most valuable dataset properties (use your own judgment) are:
To add a new dataset, open a Issue under given dataset-request
template. Gather all the related informations appropriate to it. Use the issue to track.
Check if there is existing Code or someone already working on it: See Additional Resources
Then implement it through the following steps:
working
branchdatasets.py
and codepile.py
Citation Placeholder:
@misc{Code-Pile,
author = {},
doi = {},
month = {},
title = {},
url = {https://github.com/CarperAI/Code-Pile},
version = {},
year = {2022}
}
Closely related projects:
Previous work: