Open ncoop57 opened 2 years ago
Based on information from @PhungVanDuy and my own research so far:
https://ghtorrent.org/ project doesn't seem to be active. Data is only there till 2019. Even that, we can only get the issue ids.
Bigquery public dataset github_repo doesn't have issues and comments.
So our options seem to be either extracting the necessary data from githubarchive bigquery dataset or via the Github API directly.
Information in githubarchive
is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc).
bigquery data has a top level field called "type" which is the event type.
Interested event types for us are IssueCommentEvent
, IssuesEvent
I think Ref
This documentation says that the payload.action field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. Querying for other actions results in no data. I tried querying for multiple days tables data randomly and none of them have data other than "created". I am not sure if only "created" events are being archived.
SELECT * FROM
(
select JSON_EXTRACT(payload, '$.action') as action, type, payload,
from `githubarchive.day.20210912`
WHERE type = 'IssueCommentEvent'
) tb1
WHERE tb1.action != '"created"'
LIMIT 100
Some stats of this dataset Cumulative size of all the daily tables in bigquery is 17.7 TB as of today. Total events: 5B+
Based on information from @PhungVanDuy and my own research so far:
https://ghtorrent.org/ project doesn't seem to be active. Data is only there till 2019. Even that, we can only get the issue ids.
Bigquery public dataset github_repo doesn't have issues and comments.
So our options seem to be either extracting the necessary data from githubarchive bigquery dataset or via the Github API directly.
Information in
githubarchive
is events data, it's not a snapshot of github data.There will be multiple events (create, update, delete etc) for same github resource (issue, repo etc).bigquery data has a top level field called "type" which is the event type. Interested event types for us are
IssueCommentEvent
,IssuesEvent
I think Ref This documentation says that the payload.action field can be "created", "edited" or "deleted". but, bigquery seems to only contain data for "created" action. Querying for other actions results in no data. I tried querying for multiple days tables data randomly and none of them have data other than "created". I am not sure if only "created" events are being archived.SELECT * FROM ( select JSON_EXTRACT(payload, '$.action') as action, type, payload, from `githubarchive.day.20210912` WHERE type = 'IssueCommentEvent' ) tb1 WHERE tb1.action != '"created"' LIMIT 100
Some stats of this dataset Cumulative size of all the daily tables in bigquery is 17.7 TB as of today. Total events: 5B+
Thank you for the great summarize. It looks the same as my observation before. I wonder if based on multiple data tables, we can merge every event in the same GitHub URL, Github URL should be available on the query above. After that, we can sort groups by time created?
I am contemplating on few things.. If the githubarchive project is only archiving "created" events, are we ok to ignore the edits being made afterwards? Maybe the percentage of comments getting edited is small? I am not sure.
I have also downloaded a couple of raw dumps from https://www.gharchive.org/ I only see "created" for comments. For "issues", the action is only one of "opened", "closed", "reopened".
So, I guess for comments, we don't need to group. For issues, we will need to group
We need to decided whether this is ok for the training requirements or not.
If it is ok to ignore the edits, then the next question is whether we want to rely on bigquery or not. We can also download the raw dumps and filter it ourselves. We should be able to filter the dataset pretty fast using something like pyspark.
On the other hand, we can do some level of filtering itself in big query and download the filtered events. I am not sure the kind of costs that would involve in using big query.
If the content edits are important, we may have to collect the ids/urls from githubarchive and then use github api to get the latest data.
Here are the sample events for reference. https://gist.github.com/vanga/c8c99ac032f14ae15172148df792639c
It is clarified here on the fact that edit events are not part of the gharchive.
The Issue comments API supports managing comments on issues and pull requests. Every pull request is an issue, but not every issue is a pull request. For this reason, "shared" actions for both features, like managing assignees, labels, and milestones, are provided within Issues API. To manage pull request review comments, use the Pull request review comments API instead.
Github APIs treat both issues and pull request in a similar manner. Ref
IssuesEvent
contains events related to issue + pull request creation/closed events
IssueCommentEvent
contains events related to issue + pull request comments.
In bigquery, we can filter for events that are not pull requests like this
SELECT * FROM
(
select JSON_QUERY(payload, '$.issue.pull_request') as pull_request, type, payload
from `githubarchive.day.20160912`
WHERE type = 'IssuesEvent'
) tb1
WHERE tb1.pull_request IS NULL
LIMIT 100
Similarly, comments that are not pull requests' comments can be extracted.
GitHub Issues
Dataset URL - here
Does the dataset exists in a scraped format ?
URL if Yes - here Only for HF datasets repository
Description
GitHub Issues are bug reports, feature requests, and discussions related to a repository. It contains text in a special GitHub markdown format and contains comments and reactions.
Procedure
We can use the procedure discuss in this blog post, which outlines how to do it for a specific repository. We just need to apply the exact same procedure, but for multiple repositories.
Tests
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data: