filmaj / oss-contributors

How do tech companies rank amongst themselves when it comes to github.com activity?
Apache License 2.0
16 stars 1 forks source link

Tracking Open Source Contributors

Build a(n improved) ranking of companies-as-contributors-to-public-GitHub (based on this blog post).

Why?

The user-to-company association in the ranking blog post that inspired us is not ideal: it uses the email associated to a git config, and if the domain to the email is NOT of a public mail provider (gmail, yahoo, etc), it assumes it's a company. That's not a great way of going about it because not many people use their company's e-mail in the git config they use with their public GitHub.com account.

To make that association better, this project cross-reference GitHub.com activity, which is tracked via githubarchive.org data (and is freely available as a dataset in Google BigQuery) with GitHub.com user profiles. We pull the company field from user's profiles and store those in a periodically-updated (currently monthly) database that we then copy over into BigQuery.

Usage

The underlying bits of this project can be complex, but hopefully I am doing a decent job hiding all of that away and exposing the most useful analytics through this program's CLI.

Requirements

You will need a recent version of node.js and BigQuery credentials in order to access the data I manage. In order to get BigQuery credentials, you must sign up for a Google Cloud account. Only authenticated Google Cloud users have access to the underlying data I manage.

Clone this repo, cd into it and run npm install.

You are now ready to use the project. When in doubt, there should be decent built-in command-line help that you can invoke by running ./bin/oss.js help.

Administrating this Project

Features

Implementation

We have a BigQuery project with relevant supporting tables and queries. If you'd like access, contact @filmaj (via an issue in this repo or on twitter). This project contains:

  1. A database table tracking user-company associations (currently done in an Adobe IT managed MySQL DB). Fields include GitHub username, company field, fingerprint (ETag value as reported from GitHub, as a cache-buster). We synchronize the MySQL DB with BigQuery every now and then using a command this program provides.
  2. Another table tracks GitHub usernames active over a certain time period.
  3. For each active user identified in (2), we pound the GitHub REST API to pull user profile info, and drop the company field from that info into the DB table described in (1).

How Are Companies Tracked?

Check out the src/util/companies.js file. How it works:

  1. There is a "catch-all" regular expression (🤡) that tries to match on known tech company names.
  2. If a match is detected, then we try to map that back to a nicer label for a company name. Note that multiple expressions from the company catch-all may map to a single company (e.g. AWS, AMZN and Amazon all map back to Amazon).

Requirements

Doing The Thing

$ npm install
$ npm link

At this point you should be able to run the CLI and provide it subcommands:

Updating MySQL DB of User-Company Affiliations

This command will pull the rows from a bigquery table containing github.com usernames, pull user profile information for each user from the GitHub.com REST API and store the result of the company field (and the ETag) in a MySQL DB table.

$ node bin/oss.js update-db <bigquery-table-of-user-activity>

Running this command and pointing it to a bigquery table containing ~1.5 million github.com usernames, on last run (Feb 2018), took about 6 days.

Uploading Results Back to BigQuery

This command will push the MysQL DB up to BigQuery. This command will delete the table you specify before pushing up the results.

$ node bin/oss.js db-to-bigquery <bigquery-table-of-user-company-affiliations>

On last run (Feb 2018), this command took a few minutes to complete.

Putting It All Together

If you're still with me here: wow, thanks for sticking it out. How all of this fits together:

  1. Run the incremental user activity query on BigQuery, and store the result in a new table. I usually run this on a monthly basis, but you are free to use whatever time interval you wish.
  2. Run this program's update-db command, specifying the bigquery table name you created in (1), to get the latest company affiliations for the users identified in (1) stored in your MySQL DB. This usually takes days. You have been warned.
  3. Run this program's db-to-bigquery command to send these affiliations up to bigquery. Note that the table you specify to store these affiliations in, if it already exists, will be deleted. This should only take a few minutes.
  4. Run the contributor-count, repo-count and stars-accrued query on BigQuery, and store the result in a new table. This query will look at all github activity over the time period you specify (top of the query) and correlate it with the user-company affiliations table we created in (3). Make sure you use the correct table name for the user-company affiliations in the query (search for JOIN). BigQuery is awesome so this should never take more than a minute, though do keep an eye on your bill as, well, money goes fast ;)
  5. Bask in sweet, sweet data.

Contributing

Firstly, check out our contribution guidelines. Secondly, there are probably way better ways of doing this! For example, I've noticed that the company field info is somewhat available directly in BigQuery, so probably the whole "use a MySQL DB" thing is dumb. I'm grateful for any help 🙏.