Open jdossgollin opened 11 months ago
Explicitly cite, in the main manuscript text, all datasets and software that is associated with a digital object identifier
I think we need to be a bit more vague or specific here about what software we use. Let's say I use Julia 1.9.4 and I do a Bayesian analysis in Turing. I can cite it. Do I also cite all the other packages I used that weren't essential? Do I cite the packages they depend on? Do I cite my compiler? The papers proving that the algorithms used for addition "work"? (Yes I am being deliberately obnoxious).
I think the right approach is similar to what we have later:
Of course, “major” is subjective, which is why we aim to foster a culture of openness and reciprocity.
Thanks. Great points and thought provoking.
The intention is that the text addresses situations where it is redundant to post all of the data you used. I think reanalysis data falls into this category. I could try to clarify this by adding another sentence after "already available in a permanent archive, provide code for downloading raw data from the URL or API endpoint and share all output data for checking code correctness." It could be helpful to point out that some servers are actively maintained by data providers and the datasets are widely used. Still, if the data does not have a persistent and unique endpoint, the workflow could stop working at some point. So, at the time the data is retired, there should be some kind of deposition of the data to a single persistent and unique repository (again, to avoid redundancy). In the review of articles, you usually are pointed to a home page (without other identifying information for a particular dataset) and there is no code for downloading the data. So, maybe I can add a few more details about having download data scripts and being clear in data availability statements why raw data is not hosted at a persistent and unique repository.
I think your use-case falls into a combo of the 3b and 3c commitments. Commitment 3 is supposed to be about raw data. I realize now that the data commitments are silent on issues regarding output data from restricted/non-public raw data. That should be amended.
You're right that it's too vague. The moral principle here is something like "cite others as you would like to be cited." As far as I know, there is not really a tradition of software users in research papers citing dependencies and compilers, etc. But there is a tradition of citing "core" packages that your analysis rests on. That could be subjective, but this is where the moral principle may come into play. If I'm using a software package that was released with an accompanying manuscript (such as Environmental Modeling and Software or Journal of Open Source Software), those authors almost certainly expect to be cited by users. I can add a bit more to the commitments about this.
All great points, and your instinct to specify principles rather than try to spell out every use case is 100% on point.
Here is my attempt to go more for principles than anticipating every use case (plus your points about citing code in a manuscript): https://github.com/abpoll/climsci_commit/commit/cf5db3b620b3889d05886102550dae6da6092cf2