CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

GitHub Diffs #36

Open herbiebradley opened 1 year ago

herbiebradley commented 1 year ago

A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.

This uses PyArrow via dask to save to parquet, which makes it easily parallelisable and gives low memory usage.

See #31

LouisCastricato commented 1 year ago

Make sure to end your file in a new line

LouisCastricato commented 1 year ago

LGTM, @ncoop57 can you check?

ncoop57 commented 1 year ago

Looks good @herbiebradley. The only thing needed is a minimum test with a dummy parquet file that is tested with pytest: https://docs.pytest.org/en/7.1.x/getting-started.html. We want to make sure we don't have bugs. Also, could you enable maintainer edits for the PR in case I need to modify something quickly I can? https://github.blog/2016-09-07-improving-collaboration-with-forks/

ncoop57 commented 1 year ago

@herbiebradley @reshinthadithyan This is looking pretty solid, could you add a quick test so that I can merge?

herbiebradley commented 1 year ago

@reshinthadithyan you might want to add your scripts to this branch before we merge?