countering-bean-counting / bonnyci_ci-plunder

CI usage data plundering
2 stars 0 forks source link

Pick some easily verifiable statistics for a demographic model #5

Closed missaugustina closed 7 years ago

missaugustina commented 7 years ago

Eg, Language distribution, number of contributors, license distribution

Run this against multiple random samples to check sample variability

missaugustina commented 7 years ago

What languages are most represented overall and per strata? I'm curious if just targeting Python and Node.js projects would be sufficient for our research given the timeline. We could iterate and add additional languages to the study as we had time.

missaugustina commented 7 years ago

What is the difference between projects that belong to an organization and ones that don't? Could we potentially just filter out based on org association and still get a reasonably representative sample?

missaugustina commented 7 years ago

Language determiner (if that data isn't otherwise available): https://github.com/github/linguist

missaugustina commented 7 years ago

I did some initial analysis based on the GHTorrent data and need to feed that into R to make a report.

missaugustina commented 7 years ago

org vs non-org was shifted to event types/participation rate because it's not clear how accurate of a division that would be and event type may get us what we want more easily. Further analysis is needed in this area to be certain.

missaugustina commented 7 years ago

Initial statistics pulled from Github API: Number of repos owned by repo owner, Age of repo, repo last updated, language (per github's language detection), releases (per github's release api), readme existence, readme size, "build status" tag in readme host, build status in readme vs releases.