WeberLab-UW / ProjectTracking

Apache License 2.0
0 stars 0 forks source link

EAGER Software Sustainability Work Refresher #141

Open evamaxfield opened 1 month ago

evamaxfield commented 1 month ago

EAGER Software Sustainability

Original Goals

Iirc, there were a few primary goals for this work:

  1. To understand timelines of development surrounding project funding, as they relate to software
  2. To understand researcher's reasons for (or against) maintaining software after funding has concluded
  3. To gather a corpus of funding-linked software projects to use for future projects on understanding larger software funding and sustainability / utilization

I am sure we can extend this list but these are the ones that are clearest to me.

Things We Have Done

  1. Create and send out a survey about code creation to NSF PIs with awards in the last 10 years

    • The survey mostly focused on "code" or "no code" creation at a basic level. Additional questions were in-place which collected information about the "availability" of the code (public or private), "direct links" to the code (i.e. to GitHub), "maintenance and continued development" of the code, and "reasons for maintaining or not maintaining" the code after funding had concluded.
    • We received back ~5,200 NO-CODE email responses and ~1,142 CODE app responses over the three waves of send outs
    • In the final wave of email send outs, we received ~4.49% response rate.
  2. We conducted both quantitative and qualitative analysis of the survey responses:

    • Trying to work towards an understanding of "timelines of development", we broke out projects that had filled out the web app survey and provided us with valid GitHub repository links into six categories of development timelines:
    • "development occurred strictly before funding had started"
    • "development occurred strictly after funding had ended"
    • "development occurred before and during the funding period"
    • "development occurred during and after the funding period"
    • "development occurred only during the funding period"
    • and "development occurred before, during, and after the funding period".

image

Major Takeaways So Far

1. To understand timelines of development surrounding project funding, as they relate to software.

I think we have the basics of this answered. We have broken projects out into the six possible timelines of development. That is: "software development during the funding period is most common" and "during and after" or "before during and after" seem to be the largest categories outside of just "during". So projects are started in a grant funding period and frequently continue afterwards.

There was always a big note here that this didn't feel like a full enough response.

2. To understand researcher's reasons for (or against) maintaining software after funding has concluded.

We partially didn't really look into this besides counting the number of responses for different reasons at the surface level. I still believe we should try and frame our work into the "four quadrants of software sustainability" but we might want to look at the reasons that are most frequently associated with each quadrant.

3. To gather a corpus of funding-linked software projects to use for future projects on understanding larger software funding and sustainability / utilization.

We have the basics of this started as well as a clear workflow for mining for an ever larger dataset. One of the main direct pieces of contribution that the current mining workflow is supposed to give us for a publication / report is to at the very least provide us with a "we found that X% of grants have Y number of publications which link out to software on average" statement to understand "what could be out there" vs "what our survey proportions are".

Potential Next Steps

1. To understand timelines of development surrounding project funding, as they relate to software.

  1. More depth: look at not only the development timelines of these repos but additionally, in the cases where development happened before or after the funding period in question, look in the README for references to other funding. Basically: can we detect the factors that correlate with development which happens outside of the funding window?
  2. A clearer definition of "before and after." Currently all of the counts are done with "was the repo created anytime (even an hour) before the grant funding started" and "was the most recent commit anytime (even an hour) after the grant funding ended." During my qualitative review I think I decided somewhere on "was there substantive commits six months to a year after the funding ending"? so coming up with a metric for "substantive commits" (could be absolute lines changed from weekly git commit logs (GitHub has an API for this) and a firm stance on what counts as "before and after", I think six-months is viable.

2. To understand researcher's reasons for (or against) maintaining software after funding has concluded.

As I said before, I think we stick to the four quadrants framing and we enrich that idea by looking at which reasons are associated to each quadrant to try and explain the behavior. We can also try to model this via the planned and supported models if we find proxy variables for the "reasons" but this may be better served as qualitative.

3. Publication outline.

While we aren't ready to fully start drafting a publication, I think we do have enough to figure out how we would want to even write about this work. We have some research questions, we have the collected data, what are the analyses to run with the data we already have that can give us nice clear answers to the research questions?

### Tasks
- [x] Clone Repos
- [ ] https://github.com/PugetSoundClinic-PIT/ProjectTracking/issues/142
- [ ] Grant duration binned development timelines relative to grant start
- [ ] Start draft that includes the RQs / Contributions ...
nniiicc commented 1 month ago

https://docs.google.com/document/d/163CXnYWYLasMebIpKdL54YR62UDvlfnZwxMFAPn9qjs/edit

evamaxfield commented 1 month ago

One big thing that we said we would do but haven't done is use CHAOSS metrics on the identified repositories. This also would give us I believe daily snapshots of basic metrics which is incredibly useful for understanding timelines.

nniiicc commented 1 month ago

To do