General comments - Githubissues

jdossgollin commented 11 months ago

Commitment 1: I'm not entirely convinced of the value of sharing several hundred gigabytes of reanalysis data when it's very readily downloaded from something like ECMWF servers. Committing to share all input data under a terabyte seems like a way to make everyone's repository really big for no reason. I would love to soften the language there a little bit. I think, when collecting the raw input data is really challenging or buggy, this makes sense, but the vast majority of my projects just use some scripts to pull data from, say, ERA5 and I don't think they would be easier to work with if I gave someone 10 GB Zenodo DOI to download. If there are really compelling rationale that I'm not aware of, I'm happy to change my mind.
Commitment 3: this is a specific issue, perhaps not generally relevant, but one place where I'm thinking about this is for our project on IDF curves for the state of Texas. We have been given a series of annual maxima, which includes data from stem stations that are not publicly available. We will of course, share all codes to reproduce analysis on the set of public stations. However, I think it would be useful to also share the final estimates, that the state may disseminate, in a paper.

jdossgollin commented 11 months ago

Explicitly cite, in the main manuscript text, all datasets and software that is associated with a digital object identifier

I think we need to be a bit more vague or specific here about what software we use. Let's say I use Julia 1.9.4 and I do a Bayesian analysis in Turing. I can cite it. Do I also cite all the other packages I used that weren't essential? Do I cite the packages they depend on? Do I cite my compiler? The papers proving that the algorithms used for addition "work"? (Yes I am being deliberately obnoxious).

I think the right approach is similar to what we have later:

Of course, “major” is subjective, which is why we aim to foster a culture of openness and reciprocity.

abpoll commented 11 months ago

Thanks. Great points and thought provoking.

The intention is that the text addresses situations where it is redundant to post all of the data you used. I think reanalysis data falls into this category. I could try to clarify this by adding another sentence after "already available in a permanent archive, provide code for downloading raw data from the URL or API endpoint and share all output data for checking code correctness." It could be helpful to point out that some servers are actively maintained by data providers and the datasets are widely used. Still, if the data does not have a persistent and unique endpoint, the workflow could stop working at some point. So, at the time the data is retired, there should be some kind of deposition of the data to a single persistent and unique repository (again, to avoid redundancy). In the review of articles, you usually are pointed to a home page (without other identifying information for a particular dataset) and there is no code for downloading the data. So, maybe I can add a few more details about having download data scripts and being clear in data availability statements why raw data is not hosted at a persistent and unique repository.
I think your use-case falls into a combo of the 3b and 3c commitments. Commitment 3 is supposed to be about raw data. I realize now that the data commitments are silent on issues regarding output data from restricted/non-public raw data. That should be amended.
You're right that it's too vague. The moral principle here is something like "cite others as you would like to be cited." As far as I know, there is not really a tradition of software users in research papers citing dependencies and compilers, etc. But there is a tradition of citing "core" packages that your analysis rests on. That could be subjective, but this is where the moral principle may come into play. If I'm using a software package that was released with an accompanying manuscript (such as Environmental Modeling and Software or Journal of Open Source Software), those authors almost certainly expect to be cited by users. I can add a bit more to the commitments about this.

jdossgollin commented 11 months ago

All great points, and your instinct to specify principles rather than try to spell out every use case is 100% on point.

abpoll commented 11 months ago

Here is my attempt to go more for principles than anticipating every use case (plus your points about citing code in a manuscript): https://github.com/abpoll/climsci_commit/commit/cf5db3b620b3889d05886102550dae6da6092cf2

abpoll / climsci_commit

General comments #1