Closed missaugustina closed 7 years ago
What languages are most represented overall and per strata? I'm curious if just targeting Python and Node.js projects would be sufficient for our research given the timeline. We could iterate and add additional languages to the study as we had time.
What is the difference between projects that belong to an organization and ones that don't? Could we potentially just filter out based on org association and still get a reasonably representative sample?
Language determiner (if that data isn't otherwise available): https://github.com/github/linguist
I did some initial analysis based on the GHTorrent data and need to feed that into R to make a report.
org vs non-org was shifted to event types/participation rate because it's not clear how accurate of a division that would be and event type may get us what we want more easily. Further analysis is needed in this area to be certain.
Initial statistics pulled from Github API: Number of repos owned by repo owner, Age of repo, repo last updated, language (per github's language detection), releases (per github's release api), readme existence, readme size, "build status" tag in readme host, build status in readme vs releases.
Eg, Language distribution, number of contributors, license distribution
Run this against multiple random samples to check sample variability