evamaxfield commented 1 month ago

EAGER Software Sustainability

Original Goals

Iirc, there were a few primary goals for this work:

To understand timelines of development surrounding project funding, as they relate to software
To understand researcher's reasons for (or against) maintaining software after funding has concluded
To gather a corpus of funding-linked software projects to use for future projects on understanding larger software funding and sustainability / utilization

I am sure we can extend this list but these are the ones that are clearest to me.

Things We Have Done

Create and send out a survey about code creation to NSF PIs with awards in the last 10 years
- The survey mostly focused on "code" or "no code" creation at a basic level. Additional questions were in-place which collected information about the "availability" of the code (public or private), "direct links" to the code (i.e. to GitHub), "maintenance and continued development" of the code, and "reasons for maintaining or not maintaining" the code after funding had concluded.
- We received back ~5,200 NO-CODE email responses and ~1,142 CODE app responses over the three waves of send outs
- In the final wave of email send outs, we received ~4.49% response rate.
We conducted both quantitative and qualitative analysis of the survey responses:
- Trying to work towards an understanding of "timelines of development", we broke out projects that had filled out the web app survey and provided us with valid GitHub repository links into six categories of development timelines:
- "development occurred strictly before funding had started"
- "development occurred strictly after funding had ended"
- "development occurred before and during the funding period"
- "development occurred during and after the funding period"
- "development occurred only during the funding period"
- and "development occurred before, during, and after the funding period".

We also looked at how well our soft-search model performed based on the responses. I.e. if someone said they generally produced code, what was the precision and recall as compared to our model prediction. Iirc our model didn't perform too well (~60% F1) but I don't remember if I subset the "app responses" to only mark responses as true if they provided us with a GitHub repo (one of our largest points of criticism of that model).
I briefly tried to fit logistic regression models with the response data (i.e. planned_to_support_binary = a + b + c) but with only ~400 responses that answered the "planned to support" question, we there is generally low confidence in the model and none of the features were significant.
- From the state of the repo on my local machine, it seems like I never tried fitting a model to did_support_binary but more on that later.
Finally, we tried to form a narrative around the four possible quadrants a project that produces code could have fallen into:
- did-plan + did-sustain
- did-plan + did-not-sustain
- didn't-plan + did-sustain
- and didn't-plan + didn't-sustain
- I selected a random sample of repos in each of these categories to qualitatively review and take notes on to think through how to better dissect these with metrics. I wrote up my notes in this document: https://docs.google.com/document/d/1mLfwrksjJJGI1rLNvBxy6WRVl2jK7lK79xGf3UB0fAk/edit?usp=sharing

We created a workflow to mine for more funding-software-link pairs using GROBID
- https://github.com/evamaxfield/grobid-soft-proc
- This allows us to upload a CSV of DOIs to download a PDF for, open the PDF with GROBID, and mine for software-mentions in the "shared" classification (meaning, a link was attached to the mention).
- The chain of processing here was, NSF award -> associated papers -> mine each paper for software links.
- In general it seems like I can easily spin this processing server back up again decently easily, the problem comes when I drop my SSH connection to the GCP machine and GCP decides to kill the machine because there isn't an active user.

Major Takeaways So Far

1. To understand timelines of development surrounding project funding, as they relate to software.

I think we have the basics of this answered. We have broken projects out into the six possible timelines of development. That is: "software development during the funding period is most common" and "during and after" or "before during and after" seem to be the largest categories outside of just "during". So projects are started in a grant funding period and frequently continue afterwards.

There was always a big note here that this didn't feel like a full enough response.

2. To understand researcher's reasons for (or against) maintaining software after funding has concluded.

We partially didn't really look into this besides counting the number of responses for different reasons at the surface level. I still believe we should try and frame our work into the "four quadrants of software sustainability" but we might want to look at the reasons that are most frequently associated with each quadrant.

3. To gather a corpus of funding-linked software projects to use for future projects on understanding larger software funding and sustainability / utilization.

We have the basics of this started as well as a clear workflow for mining for an ever larger dataset. One of the main direct pieces of contribution that the current mining workflow is supposed to give us for a publication / report is to at the very least provide us with a "we found that X% of grants have Y number of publications which link out to software on average" statement to understand "what could be out there" vs "what our survey proportions are".

Potential Next Steps

1. To understand timelines of development surrounding project funding, as they relate to software.

More depth: look at not only the development timelines of these repos but additionally, in the cases where development happened before or after the funding period in question, look in the README for references to other funding. Basically: can we detect the factors that correlate with development which happens outside of the funding window?
A clearer definition of "before and after." Currently all of the counts are done with "was the repo created anytime (even an hour) before the grant funding started" and "was the most recent commit anytime (even an hour) after the grant funding ended." During my qualitative review I think I decided somewhere on "was there substantive commits six months to a year after the funding ending"? so coming up with a metric for "substantive commits" (could be absolute lines changed from weekly git commit logs (GitHub has an API for this) and a firm stance on what counts as "before and after", I think six-months is viable.

2. To understand researcher's reasons for (or against) maintaining software after funding has concluded.

As I said before, I think we stick to the four quadrants framing and we enrich that idea by looking at which reasons are associated to each quadrant to try and explain the behavior. We can also try to model this via the planned and supported models if we find proxy variables for the "reasons" but this may be better served as qualitative.

3. Publication outline.

While we aren't ready to fully start drafting a publication, I think we do have enough to figure out how we would want to even write about this work. We have some research questions, we have the collected data, what are the analyses to run with the data we already have that can give us nice clear answers to the research questions?

### Tasks
- [x] Clone Repos
- [ ] https://github.com/PugetSoundClinic-PIT/ProjectTracking/issues/142
- [ ] Grant duration binned development timelines relative to grant start
- [ ] Start draft that includes the RQs / Contributions ...

nniiicc commented 1 month ago

https://docs.google.com/document/d/163CXnYWYLasMebIpKdL54YR62UDvlfnZwxMFAPn9qjs/edit

evamaxfield commented 1 month ago

One big thing that we said we would do but haven't done is use CHAOSS metrics on the identified repositories. This also would give us I believe daily snapshots of basic metrics which is incredibly useful for understanding timelines.

nniiicc commented 1 month ago

To do

@nniiicc - think about first contribution which is going to be about the survey participants, and what they report; RQ 2 will analyze different patterns of development in PI repos; RQ3 will put projects into quadrant, and describe how patterns of development relate to planned vs sustained development (or something like that) - maybe something related to intentionality (if you planned and you didn't sustain - we see other features (like documentation, attention metrics, etc) that may supplement our measurement of sustainability which is based on commit)
@evamaxfield (see above)

WeberLab-UW / ProjectTracking