Closed sgoggins closed 7 years ago
Hey @bkeepers and @wingr : A few thoughts on our approach here would be most welcome. :)
How and to what extent are health and sustainability indicators understood by community owners and other stakeholders?
Just a note: The discussion on developing and understanding of the indicators occurs in the HealthIndicators repository
Hey @sgoggins, sorry for the delay here. I just wanted to give you a heads up that I'm at FOSDEM right now now and it'll be a few more days before I get a chance to reply.
@sgoggins likewise sorry for the delayed response.
Looking over the technical section, your approach looks sound to me. I also believe that GHTorrent and GitHub Archive are going to be your best source of information, although @bkeepers knows a little more about the public data sources than I do. I believe that you should be able to get what you need from the GitHub archive without needing privileged API access and that it should provide you with enough up-to-date information since it is updated hourly.
There are also a number of scripts and wrapper code that people have created to help pull data from these sources that you can find by Googling.
A team I work closely with is in the process of trying to get better documentation around using these public data sets, so ping me with questions or challenges that your encounter and I will pass them along and try to help where I can.
Hey @sgoggins, I agree with @wingr that GHTorrent and GitHub Archive are going to be your best sources of information.
As for keeping the information up to date, I don't have any great advice at the moment, but if this is still a challenge, could connect you with some folks I know that keep an internal copy of the GHTorrent data set to see how they do it.
How's everything going with regard to access to the data?
We got this worked out!
Hi everyone:
@bkeepers @howderek @wingr especially:
I am writing to give you an overview of the technical progress we are making, and to foreshadow future requests for accelerated or privileged access to the GitHub API that we may request. There are headings below so you can scan.
BACKGROUND (We All hopefully share this now for the most part):
Looking at changes over time in GitHub repositories will be essential to the aims of our project: understanding their health and sustainability. We hypothesize (and, based on preliminary work, we think with some likelihood that we are right) the following:
I think these flow as lower level operating tests from our research questions:
TECHNICAL APPROACH:
Here, to some extent, we are looking to Brandon and Rowan to validate that we are not missing any key concepts or attributes of the available resources from GitHub. In particular, if there are limitations in the data archives and torrents we are referencing, those would be good to be aware of.
We are doing our indicator development against GHTorrent and the GitHub Archive.
Once indicators are mature enough to evaluate (estimated 4-6 weeks), we will need more current information to validate with project stakeholders, who will likely have less recall of things going on a month or two ago than last week. We think less archival indicators are also going to be more compelling for GitHub users generally. To that end,
Perhaps this is too much for an email and a call is warranted? But I thought I would start here!
Thanks!