chaoss / augur

Python library and web service for Open Source Software Health and Sustainability metrics & data collection. You can find our documentation and new contributor information easily here: https://oss-augur.readthedocs.io/en/main/ and learn more about Augur at our website https://augurlabs.io
https://oss-augur.readthedocs.io/en/main/
MIT License
589 stars 845 forks source link

Technical Progress General Inquiry #4

Closed sgoggins closed 7 years ago

sgoggins commented 7 years ago

Hi everyone:

@bkeepers @howderek @wingr especially:

I am writing to give you an overview of the technical progress we are making, and to foreshadow future requests for accelerated or privileged access to the GitHub API that we may request. There are headings below so you can scan.

BACKGROUND (We All hopefully share this now for the most part):

Looking at changes over time in GitHub repositories will be essential to the aims of our project: understanding their health and sustainability. We hypothesize (and, based on preliminary work, we think with some likelihood that we are right) the following:

  1. H1: There is a relationship between derivable indicators of repository activity on GitHub and the type of organization governing the project
  2. H2: There is a relationship between derivable indicators of repository activity on GitHub and performance, as perceived from the perspective of various stakeholders.
  3. H3: Different stakeholders (owners, contributors, users, regulators, etc.) will be influenced by different combinations of indicators.

I think these flow as lower level operating tests from our research questions:

  1. How and to what extent are community health and sustainability indicators identifiable from GitHub open source community data?
  2. What are dominant genres of community based on health and sustainability indicators, and how and to what extent are health and sustainability indicators different between these communities?
  3. How and to what extent are health and sustainability indicators understood by community owners and other stakeholders?
  4. How and to what extent do heath and sustainability indicators change over time as communities evolve to include increased membership, new governance structures, and support from foundations?

TECHNICAL APPROACH:

Here, to some extent, we are looking to Brandon and Rowan to validate that we are not missing any key concepts or attributes of the available resources from GitHub. In particular, if there are limitations in the data archives and torrents we are referencing, those would be good to be aware of.

  1. We are doing our indicator development against GHTorrent and the GitHub Archive.

    1. Since data about deleted repositories and users may play a role in our research, it's necessary to use archives of GitHub data as opposed to the timestamp information included in GitHub API requests.
    2. From our initial exploration, it appears there will be two projects that will meet our needs, GHTorrent and GitHub Archive
    3. GHTorrent provides a SQL database of metadata created from the events stream, and GitHub Archive archives those events themselves.
    4. There is a lot of overlap between the datasets, but both are needed. A fast interface to the data is needed, such as the SQL database that is populated by GHTorrent.
  2. Once indicators are mature enough to evaluate (estimated 4-6 weeks), we will need more current information to validate with project stakeholders, who will likely have less recall of things going on a month or two ago than last week. We think less archival indicators are also going to be more compelling for GitHub users generally. To that end,

    1. The data we use will need to become quite “up to date”. What is the best strategy?
      1. Daily dumps provided by the GitHub Archive to fill in the gaps between the SQL backups provided by GHTorrent and the realtime data provided by the GitHub API?
      2. Privileged API Access?
      3. Both?
      4. Other?
    2. Ideally, we would like to demonstrate indicators and provide an indicator exploration site with the hope of prototyping a system that could be used to gain wider evaluation of the indicators (from GitHub’s ecosystem).

Perhaps this is too much for an email and a call is warranted? But I thought I would start here!

Thanks!

sgoggins commented 7 years ago

Hey @bkeepers and @wingr : A few thoughts on our approach here would be most welcome. :)

GeorgLink commented 7 years ago

How and to what extent are health and sustainability indicators understood by community owners and other stakeholders?

Just a note: The discussion on developing and understanding of the indicators occurs in the HealthIndicators repository

bkeepers commented 7 years ago

Hey @sgoggins, sorry for the delay here. I just wanted to give you a heads up that I'm at FOSDEM right now now and it'll be a few more days before I get a chance to reply.

wingr commented 7 years ago

@sgoggins likewise sorry for the delayed response.

Looking over the technical section, your approach looks sound to me. I also believe that GHTorrent and GitHub Archive are going to be your best source of information, although @bkeepers knows a little more about the public data sources than I do. I believe that you should be able to get what you need from the GitHub archive without needing privileged API access and that it should provide you with enough up-to-date information since it is updated hourly.

There are also a number of scripts and wrapper code that people have created to help pull data from these sources that you can find by Googling.

A team I work closely with is in the process of trying to get better documentation around using these public data sets, so ping me with questions or challenges that your encounter and I will pass them along and try to help where I can.

bkeepers commented 7 years ago

Hey @sgoggins, I agree with @wingr that GHTorrent and GitHub Archive are going to be your best sources of information.

As for keeping the information up to date, I don't have any great advice at the moment, but if this is still a challenge, could connect you with some folks I know that keep an internal copy of the GHTorrent data set to see how they do it.

How's everything going with regard to access to the data?

howderek commented 7 years ago

We got this worked out!