enhancements suggested for the `users_companies` table

adobe / oss-contributors

How do tech companies rank amongst themselves when it comes to github.com activity?

Apache License 2.0

35 stars 10 forks source link

enhancements suggested for the `users_companies` table #4

Open fhoffa opened 5 years ago

fhoffa commented 5 years ago

Thanks for sharing this!

https://bigquery.cloud.google.com/table/public-github-adobe:github_archive_query_views.users_companies?pli=1&tab=details

Suggestions:

Add the field 'user_id', as people can change their nick (but not their id).
Add the field 'crawled_at'. Account will have multiple companies through their lifetime, and this will allow you to attribute commits that happened x years ago to the right company.

With 'crawled_at' you'll have to allow multiple entries per user, and adjust queries later. For example, the easiest queries would go through a view that just gives the latest company per user.

filmaj commented 5 years ago

Great idea, I've had this come up a few times already in twitter questions.

This will take some modification to the node script as it keeps an in-memory DB of user:(current) company associations which is written out to a huge json file between program runs. Sounds like the users_companies table would diverge from this somewhat. Need to consider if/how the schemas for the MySQL db, the json file and the users_companies table change.

filmaj commented 5 years ago

Probably necessitates a larger discussion about the CLI interface. The node commands are essentially data transfer scripts between bigquery, mysql and json, and the github.com profile scraper. Data transfers between mediums are not complete - there's db-to-json and json-to-bigquery, but not viceversa / anything else. Would you find those commands helpful?