bssw-psip / reposcanner

A compact repository data mining toolkit
Other
4 stars 0 forks source link

Complete initial pilot data collection: Phase 1 #6

Open bhsims opened 4 years ago

bhsims commented 4 years ago

Execute RepoScanner with latest repo list and view results.

elaineraybourn commented 4 years ago

Data collection to be considered in different phases. The first pilot phase explores the "contributor" activity in ECP repos. Data are scraped from GitHub (when available BitBucket, GitLab). The definitions guiding the Phase 1 (Tier 1) data collection are posted in Issue #7 and included in this comment below proposed pilot data collection steps:

  1. Determine (number of) unique contributors to ECP repos (and by definition ECP project, see below)
  2. Determine (number of) unique contributors who have contributed to 2 or more ECP projects
  3. Determine rank order (greatest number to smallest number) of contributor network rankings
  4. Identify the repos with the greatest number of cross-project contributor network ranking
  5. Identify the repos with the least number of cross-project contributor network ranking
  6. ...

Definitions For "author" we use "contributor" in our research study document. In developing our contributor or "author" classification scheme for data analysis we need to be clear of the scope of the analysis, and how we are defining terms. I propose the following definitions to guide data analysis: Phase 1 (Tier 1 -- lowest level of analysis) Repo is defined as a GitHub repository. For the purposes of this Phase, we are interested primarily in repos that are associated with ECP projects. Commit is defined as a save (of the current state, or snapshot) of the repository. Contributor is defined as a unique user ID with 1 or more commits to 1 or more repos attributed to an ECP project. Contributor ranking is defined as the number of commits. The greater the number of commits, the greater the ranking of the contributor. Contribution is defined as a commit generated by a human, and potentially, a contribution by a bot created by a human. Cross-repo contribution is defined as one or more commits by a unique contributor to two or more different repos Cross-project contribution is defined as one or more commits to one or more ECP project repos. A project is defined as a formal collection of repos (e.g. ADTM, ALPINE). Only one commit to any ECP project repo is necessary to be considered a contribution to the ECP project, if multiple repos exist in the project. Contributor network ranking is defined as the number of commits in repos that are attributed to a number of different ECP project repos. The greater the number of commits and greater the number of ECP project repos the higher the ranking of the individual contributor.