Build a(n improved) ranking of companies-as-contributors-to-public-GitHub (based on this blog post).
The user-to-company association in the ranking blog post that inspired us
is not ideal: it uses the email associated to a git
config, and if the domain
to the email is NOT of a public mail provider (gmail, yahoo, etc), it assumes it's
a company. That's not a great way of going about it because not many people use
their company's e-mail in the git config they use with their public GitHub.com account.
To make that association better, this project cross-reference GitHub.com activity,
which is tracked via githubarchive.org data (and is
freely available as a dataset in
Google BigQuery) with
GitHub.com user profiles. We pull the company
field
from user's profiles and store those in a periodically-updated (currently
monthly) database that we then copy over into BigQuery.
The underlying bits of this project can be complex, but hopefully I am doing a decent job hiding all of that away and exposing the most useful analytics through this program's CLI.
You will need a recent version of node.js and BigQuery credentials in order to access the data I manage. In order to get BigQuery credentials, you must sign up for a Google Cloud account. Only authenticated Google Cloud users have access to the underlying data I manage.
Clone this repo, cd
into it and run npm install
.
You are now ready to use the project. When in doubt, there should be decent
built-in command-line help that you can invoke by running ./bin/oss.js help
.
We have a BigQuery project with relevant supporting tables and queries. If you'd like access, contact @filmaj (via an issue in this repo or on twitter). This project contains:
users_pushes_2017
,
that we used as a baseline. This table tracked all GitHub users that had
at least 1 commit in 2017.company
field from that info into the DB
table described in (1).Check out the src/util/companies.js
file. How it
works:
bigquery.json
file is needed in the root of the repo, which contains the
credentials for access to Google Cloud BigQuery. More info on how to set this
file up is available on BigQuery
docs.oauth.token
file is needed in the root of the repo, which contains GitHub.com
personal access tokens, one per line, which we will use to get data from
api.github.com. In my version of this file, I have several tokens (thanks to all
my nice friends who graciously granted me one) as there is a maximum of 5,000 calls
per hour to the GitHub REST API.leopardprdd
, database name, table name and username are all
GHUSERCO
, running on port 3323. @filmaj has the password. The schema for this
table is under the usercompany.sql
file.$ npm install
$ npm link
At this point you should be able to run the CLI and provide it subcommands:
This command will pull the rows from a bigquery table containing github.com
usernames, pull user profile information for each user from the GitHub.com REST
API and store the result of the company
field (and the ETag
) in a MySQL DB
table.
$ node bin/oss.js update-db <bigquery-table-of-user-activity>
Running this command and pointing it to a bigquery table containing ~1.5 million github.com usernames, on last run (Feb 2018), took about 6 days.
This command will push the MysQL DB up to BigQuery. This command will delete the table you specify before pushing up the results.
$ node bin/oss.js db-to-bigquery <bigquery-table-of-user-company-affiliations>
On last run (Feb 2018), this command took a few minutes to complete.
If you're still with me here: wow, thanks for sticking it out. How all of this fits together:
update-db
command, specifying the
bigquery table name you created in (1), to get the latest company
affiliations for the users identified in (1) stored in your MySQL DB. This
usually takes days. You have been warned.db-to-bigquery
command to send these affiliations up
to bigquery. Note that the table you specify to store these affiliations in,
if it already exists, will be deleted. This should only take a few minutes.JOIN
). BigQuery is
awesome so this should never take more than a minute, though do keep an eye
on your bill as, well, money goes fast ;)Firstly, check out our contribution guidelines. Secondly, there are probably way better ways of doing this! For example, I've noticed that the company field info is somewhat available directly in BigQuery, so probably the whole "use a MySQL DB" thing is dumb. I'm grateful for any help 🙏.