Set up Google BigQuery for Github data archives

countering-bean-counting / bonnyci_ci-plunder

CI usage data plundering

2 stars 0 forks source link

Set up Google BigQuery for Github data archives #2

Closed missaugustina closed 7 years ago

missaugustina commented 7 years ago

The Github Archive project collects and publishes all public Github event data on Google BigQuery.

[x] Create a Google Cloud account
[x] Create a project and enable the Google BigQuery API
[x] Add the Github Archive data set to the project
[x] Run a query to see what's in there
[x] Figure out the best way to collaborate and share this with the rest of the team

missaugustina commented 7 years ago

I set up a project that uses Google BigQuery for the Github archives. These only provide dumps of events and any specific data is just dumped into a payload field in the db, which means it's not entirely searchable. The GHTorrent github archive is much more comprehensive and I need to see if I can get that data uploaded to Google BigQuery (or if someone else has already).

missaugustina commented 7 years ago

I haven't had success getting the GHTorrent data downloaded. It's about 43 GB, the download rate is very slow and the server drops the connection frequently. In addition, based on my experience uploading the same group's TravisCI data into Google BigQuery, I'm concerned that the data may be incomplete. I propose that we just use the events data for sample selection. Once we've identified our target population, we can pull a large sample based on activity (eg, projects that have public events listed in the past 6 months) and then use the subset of that sample we are interested in. We can pull any extra information for the Github projects ourselves via the API using the Shuffleboard script.

missaugustina commented 7 years ago

Google BigQuery is a component available for Google Cloud users. Right now they are offering a 60-day free trial with a $300 credit. Their rates are really good overall and I seriously doubt there's any chance of maxing that out. After the free trial it's like $5 per TB per month. For now, folks need to set up their own accounts if they are interested in manipulating these data sets. I can provide very limited access to my projects for audit purposes, but this access will come with a strict quota. I'm adding a work item to document how to set up Google BigQuery for working with these large data sets.