Closed missaugustina closed 7 years ago
I set up a project that uses Google BigQuery for the Github archives. These only provide dumps of events and any specific data is just dumped into a payload field in the db, which means it's not entirely searchable. The GHTorrent github archive is much more comprehensive and I need to see if I can get that data uploaded to Google BigQuery (or if someone else has already).
I haven't had success getting the GHTorrent data downloaded. It's about 43 GB, the download rate is very slow and the server drops the connection frequently. In addition, based on my experience uploading the same group's TravisCI data into Google BigQuery, I'm concerned that the data may be incomplete. I propose that we just use the events data for sample selection. Once we've identified our target population, we can pull a large sample based on activity (eg, projects that have public events listed in the past 6 months) and then use the subset of that sample we are interested in. We can pull any extra information for the Github projects ourselves via the API using the Shuffleboard script.
Google BigQuery is a component available for Google Cloud users. Right now they are offering a 60-day free trial with a $300 credit. Their rates are really good overall and I seriously doubt there's any chance of maxing that out. After the free trial it's like $5 per TB per month. For now, folks need to set up their own accounts if they are interested in manipulating these data sets. I can provide very limited access to my projects for audit purposes, but this access will come with a strict quota. I'm adding a work item to document how to set up Google BigQuery for working with these large data sets.
The Github Archive project collects and publishes all public Github event data on Google BigQuery.