CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

GitHub Issues #35

Open ncoop57 opened 1 year ago

ncoop57 commented 1 year ago

GitHub Issues

Dataset URL - here

Does the dataset exists in a scraped format ?
URL if Yes - here Only for HF datasets repository

Description

GitHub Issues are bug reports, feature requests, and discussions related to a repository. It contains text in a special GitHub markdown format and contains comments and reactions.

Procedure

We can use the procedure discuss in this blog post, which outlines how to do it for a specific repository. We just need to apply the exact same procedure, but for multiple repositories.

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

issue_post comments authors reactions
issue_text [comment_1, comment_2, ...] [issue_author, comment_1_author, comment_2_author, ...] [[reactions], [reactions], ...]
vanga commented 1 year ago

Based on information from @PhungVanDuy and my own research so far:

https://ghtorrent.org/ project doesn't seem to be active. Data is only there till 2019. Even that, we can only get the issue ids.

Bigquery public dataset github_repo doesn't have issues and comments.

So our options seem to be either extracting the necessary data from githubarchive bigquery dataset or via the Github API directly.

Information in githubarchive is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc).

bigquery data has a top level field called "type" which is the event type. Interested event types for us are IssueCommentEvent, IssuesEvent I think Ref This documentation says that the payload.action field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. Querying for other actions results in no data. I tried querying for multiple days tables data randomly and none of them have data other than "created". I am not sure if only "created" events are being archived.

SELECT * FROM
( 
  select JSON_EXTRACT(payload, '$.action') as action, type, payload, 
  from `githubarchive.day.20210912`
  WHERE type = 'IssueCommentEvent'
) tb1
WHERE tb1.action != '"created"'
LIMIT 100

Some stats of this dataset Cumulative size of all the daily tables in bigquery is 17.7 TB as of today. Total events: 5B+

PhungVanDuy commented 1 year ago

Based on information from @PhungVanDuy and my own research so far:

https://ghtorrent.org/ project doesn't seem to be active. Data is only there till 2019. Even that, we can only get the issue ids.

Bigquery public dataset github_repo doesn't have issues and comments.

So our options seem to be either extracting the necessary data from githubarchive bigquery dataset or via the Github API directly.

Information in githubarchive is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc).

bigquery data has a top level field called "type" which is the event type. Interested event types for us are IssueCommentEvent, IssuesEvent I think Ref This documentation says that the payload.action field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. Querying for other actions results in no data. I tried querying for multiple days tables data randomly and none of them have data other than "created". I am not sure if only "created" events are being archived.

SELECT * FROM
( 
  select JSON_EXTRACT(payload, '$.action') as action, type, payload, 
  from `githubarchive.day.20210912`
  WHERE type = 'IssueCommentEvent'
) tb1
WHERE tb1.action != '"created"'
LIMIT 100

Some stats of this dataset Cumulative size of all the daily tables in bigquery is 17.7 TB as of today. Total events: 5B+

Thank you for the great summarize. It looks the same as my observation before. I wonder if based on multiple data tables, we can merge every event in the same GitHub URL, Github URL should be available on the query above. After that, we can sort groups by time created?

vanga commented 1 year ago

I am contemplating on few things.. If the githubarchive project is only archiving "created" events, are we ok to ignore the edits being made afterwards? Maybe the percentage of comments getting edited is small? I am not sure.

I have also downloaded a couple of raw dumps from https://www.gharchive.org/ I only see "created" for comments. For "issues", the action is only one of "opened", "closed", "reopened".

So, I guess for comments, we don't need to group. For issues, we will need to group

We need to decided whether this is ok for the training requirements or not.

If it is ok to ignore the edits, then the next question is whether we want to rely on bigquery or not. We can also download the raw dumps and filter it ourselves. We should be able to filter the dataset pretty fast using something like pyspark.

On the other hand, we can do some level of filtering itself in big query and download the filtered events. I am not sure the kind of costs that would involve in using big query.

If the content edits are important, we may have to collect the ids/urls from githubarchive and then use github api to get the latest data.

Here are the sample events for reference. https://gist.github.com/vanga/c8c99ac032f14ae15172148df792639c

vanga commented 1 year ago

It is clarified here on the fact that edit events are not part of the gharchive.

vanga commented 1 year ago

The Issue comments API supports managing comments on issues and pull requests. Every pull request is an issue, but not every issue is a pull request. For this reason, "shared" actions for both features, like managing assignees, labels, and milestones, are provided within Issues API. To manage pull request review comments, use the Pull request review comments API instead.

Github APIs treat both issues and pull request in a similar manner. Ref

IssuesEvent contains events related to issue + pull request creation/closed events IssueCommentEvent contains events related to issue + pull request comments.

In bigquery, we can filter for events that are not pull requests like this

SELECT * FROM
( 
  select JSON_QUERY(payload, '$.issue.pull_request') as pull_request, type, payload
  from `githubarchive.day.20160912`
  WHERE type = 'IssuesEvent'
) tb1
WHERE tb1.pull_request IS NULL
LIMIT 100

Similarly, comments that are not pull requests' comments can be extracted.