Open evamaxfield opened 1 month ago
One big thing that we said we would do but haven't done is use CHAOSS metrics on the identified repositories. This also would give us I believe daily snapshots of basic metrics which is incredibly useful for understanding timelines.
To do
@nniiicc - think about first contribution which is going to be about the survey participants, and what they report; RQ 2 will analyze different patterns of development in PI repos; RQ3 will put projects into quadrant, and describe how patterns of development relate to planned vs sustained development (or something like that) - maybe something related to intentionality (if you planned and you didn't sustain - we see other features (like documentation, attention metrics, etc) that may supplement our measurement of sustainability which is based on commit)
@evamaxfield (see above)
EAGER Software Sustainability
Original Goals
Iirc, there were a few primary goals for this work:
I am sure we can extend this list but these are the ones that are clearest to me.
Things We Have Done
Create and send out a survey about code creation to NSF PIs with awards in the last 10 years
We conducted both quantitative and qualitative analysis of the survey responses:
We also looked at how well our soft-search model performed based on the responses. I.e. if someone said they generally produced code, what was the precision and recall as compared to our model prediction. Iirc our model didn't perform too well (~60% F1) but I don't remember if I subset the "app responses" to only mark responses as true if they provided us with a GitHub repo (one of our largest points of criticism of that model).
I briefly tried to fit logistic regression models with the response data (i.e.
planned_to_support_binary = a + b + c
) but with only ~400 responses that answered the "planned to support" question, we there is generally low confidence in the model and none of the features were significant.did_support_binary
but more on that later.Finally, we tried to form a narrative around the four possible quadrants a project that produces code could have fallen into:
Major Takeaways So Far
1. To understand timelines of development surrounding project funding, as they relate to software.
I think we have the basics of this answered. We have broken projects out into the six possible timelines of development. That is: "software development during the funding period is most common" and "during and after" or "before during and after" seem to be the largest categories outside of just "during". So projects are started in a grant funding period and frequently continue afterwards.
There was always a big note here that this didn't feel like a full enough response.
2. To understand researcher's reasons for (or against) maintaining software after funding has concluded.
We partially didn't really look into this besides counting the number of responses for different reasons at the surface level. I still believe we should try and frame our work into the "four quadrants of software sustainability" but we might want to look at the reasons that are most frequently associated with each quadrant.
3. To gather a corpus of funding-linked software projects to use for future projects on understanding larger software funding and sustainability / utilization.
We have the basics of this started as well as a clear workflow for mining for an ever larger dataset. One of the main direct pieces of contribution that the current mining workflow is supposed to give us for a publication / report is to at the very least provide us with a "we found that X% of grants have Y number of publications which link out to software on average" statement to understand "what could be out there" vs "what our survey proportions are".
Potential Next Steps
1. To understand timelines of development surrounding project funding, as they relate to software.
2. To understand researcher's reasons for (or against) maintaining software after funding has concluded.
As I said before, I think we stick to the four quadrants framing and we enrich that idea by looking at which reasons are associated to each quadrant to try and explain the behavior. We can also try to model this via the planned and supported models if we find proxy variables for the "reasons" but this may be better served as qualitative.
3. Publication outline.
While we aren't ready to fully start drafting a publication, I think we do have enough to figure out how we would want to even write about this work. We have some research questions, we have the collected data, what are the analyses to run with the data we already have that can give us nice clear answers to the research questions?