CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

GitHub Diffs #31

Open herbiebradley opened 1 year ago

herbiebradley commented 1 year ago

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

Example

Give an example of the columns and data:

before_file commit_message diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] Change version [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]
reshinthadithyan commented 1 year ago

What will be the filtering criteria for repositories we're going to index for scraping diffs?

>10 GitHub stars
>2 commits
Must have a liberal license
Exclude forks

cc @ncoop57, @herbiebradley

herbiebradley commented 1 year ago

Yes, these seem like sensible criteria, I think that should be everything we need.

reshinthadithyan commented 1 year ago

By length criteria, do you mean the Length of commit_message? If that's the case, the Table has commit message column, we can query with length constraints.

herbiebradley commented 1 year ago

I meant the length of the combined data, but after checking with Louis we decided this doesn't need to be filtered because the constraint is too highly variable and model-dependent.

So the criteria you mention above should be fine alone.

herbiebradley commented 1 year ago

Updated to remove Python specific stuff, to allow for scraping all languages.

ncoop57 commented 1 year ago

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

herbiebradley commented 1 year ago

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

Discussed this with Joel and we think that at least diffs which create files could be useful at some point in the future and potentially those which delete files too - not necessarily for ELM replication but for training refactoring models. Since this dataset could be used on several possible projects, I think it will help long term to not remove these from the scrape.

Filtering out unhelpful commit messages seems good, but I can think of some scenarios where we have short helpful commit messages so need to carefully decide on how to do that.