CICE-Consortium / CICE

Development repository for the CICE sea-ice model
Other
57 stars 131 forks source link

Namelist citation and recording #403

Open proteanplanet opened 4 years ago

proteanplanet commented 4 years ago

CICE needs a way of recording namelist settings used by our users so that groups can easily cite their model configuration in publications and presentations, and scientific transparency is maintained.

It is common to read in manuscripts that CICE was used without further expansion on what that means. CICE is a modeling framework, not one particular model, and it is possible to run several completely different computer models within the CICE framework while still calling the model "CICE". As an example, two different configurations of CICE could be a model with EAP, Mushy-Layer and Delta Eddington with five thickness categories as compared to a model using EVP, zero-layer thermodynamics and 2 thickness categories. As a consequence, it is essential that users publish precisely what configuration was employed, which can be perfectly expressed in terms of the code version, the namelist configuration, and any subsequent code changes.

To assist in this, I propose that we work on a way to write the namelist and code version to a special set of GitHub branches, thus assigning a unique hash to each namelist. An added 'nice' feature could be to also generate a QR code that users could include in talks and conference posters that would point to the git hash on GitHub, this providing instant recall of a model's configuration. Writing of the namelist hash, which would in fact be part of the log, and include the precise code version, could be triggered by a namelist argument of 'runtype' being set to production, as distinct from, for example, 'testrun'.

Please provide feedback on the possible advantages and problems this may cause, and ways you may have thought of to improve documentation of CICE use to improve reproducibility.

apcraig commented 4 years ago

I'm not convinced this fits within the Consortium's role. I believe it's up to the users to clearly document their setups. We are able to provide some information/history on code versions. There are an infinite number of ways users could then change code, input data, and/or namelist to setup a specific test case. Journals now require that code and other input be documented and available. The only way I could see the Consoritum helping is to provide a place where users could "drop" their code, input data, and namelist as a place for permanent storage. But I think that we would play little role except to provide that space. Whether the community uses it is up to them. My preference is that the Consortium not take on this role. Zenodo and other sites exist already. Lots of authors publish journal data on their own websites or via their own ftp sites. I'm open to discussing further, but just not sure how this would work in practice.

proteanplanet commented 4 years ago

The reason why this is directly relevant to us is exemplified in this paper:

https://search.proquest.com/docview/2366511405?pq-origsite=gscholar

Whether or not we choose to make it our responsibility to record configurations, or simply to make every effort short of a particular solution to promote all CICE-related software as a modeling framework, not a model, is the question.

duvivier commented 4 years ago

@proteanplanet The url doesn't work. Can you provide a doi?

proteanplanet commented 4 years ago

Try this: https://doi.org/10.1080/16742834.2020.1712186

apcraig commented 4 years ago

I fully understand why this is important, but I still don't have a good sense how the Consortium can do anything more than the tools/requirements already out there that encourage users to share source code, input data, namelist, and output. If they are not already sharing their data, I don't see how a space on the Consortium site changes that. We have very little control how the community uses the code and what it does with it. And I understand the problem's that introduces.

I agree that providing a place for users to drop their data might improve the situation. We could also do something like change the release policy to say any journal article that uses CICE MUST drop their data onto the Consortium site. Hard to believe we can get away with that or enforce it though. I guess we can also try to chase down authors and ask them to drop their data onto our site after the fact.

Again, I'm open to having a space like this on the Consortium site, but I'm not convinced it's going to change behavior. Putting something in place and supporting it would also require some effort. It sounds like the plan might be to host the code, input data, namelist, and output? I don't think that's possible thru github, it's probably too big. We could certainly provide a space for users to link to their journal article as well as their code, input data, namelist, and output. We could even host some of it (like code and namelist) if we wanted to make the effort.

This is all certainly possible and maybe it would be good to try to aggregate journal articles as well as the codes/data used/created. We'd just have to figure out how to do it. I don't think we can support github branches. I think we need something that looks more like how we serve input data to the community that would basically be a table with links to the journal article and related data.

proteanplanet commented 4 years ago

What I am proposing has already been done in other ESM codes. It is a way of aiding benchmarking. Given the response, I am closing the commenting and moving on.

apcraig commented 4 years ago

Lets continue to discuss. I just am trying to understand how this would work. Can you point me to examples in other ESM codes.

aidanheerdegen commented 4 years ago

In the COSIMA group we use a locally developed model run tool, payu. The model run directory containing all the configuration data, including namelists, is a git repository that also contains manifests of all executables and input data files and is git commit-ted with every model run.

The connection between model code and any one individual run configuration is less strong. The unique hash of the executable is known, and the git hash of the code commit is stored in the executable.

For publication purposes the model codes and configuration are published as a single monolithic git repository with code and configurations as submodules. The individual model configurations are still available individually.

Hope that helps.

aekiss commented 4 years ago

Further to @aidanheerdegen's post (and somewhat off-topic), note that in COSIMA the only things we directly store in git repositories are text files (source code, configuration files, and run manifests). We don't store anything binary (e.g. NetCDF forcing files) in git, as standard git is not well-suited to this task because it doesn't do delta-compression on binaries, so the repository quickly becomes infeasibly large.

There are a couple of ways I know of to handle large binary files with git:

  1. git-lfs (large file support) - supported by github but with impracticably small bandwidth limits on the free plan and some bad press
  2. git-annex - but this still requires storage for the actual files somewhere else.

Instead we store our binary files (executables, and input and restart NetCDFs) on our local HPC system (NCI) and use git-tracked manifests to uniquely identify what was used. This works well for our purposes but means that the published git repository is directly useable only by users who can get an account on this HPC system.

dabail10 commented 4 years ago

Here is the ES-DOC link:

https://explore.es-doc.org/cmip6/models/ncar/cesm2

eclare108213 commented 4 years ago

@dabail10 I'd like to reorganize the diagnostic output based on the ES-DOC request. The link above shows the answers for CESM2, but it looks like many of them are multiple-choice. Please, can you send me the sea ice spreadsheet (filled out or not), so I can see all of the options? Thx

dabail10 commented 4 years ago

I'm not sure if the original exists. I'll try to find it.

aekiss commented 4 years ago

fyi (apologies if irrelevant) - if you're comparing namelists you might find this tool helpful: https://github.com/aekiss/nmltab

eclare108213 commented 4 years ago

I am looking at the diagnostic output prior to the time loop and finding that it's a mess, with numerous duplications and the namelist parameters not grouped logically. I'll attempt to reorganize it while making it more verbose and hopefully easier for users to pull out a model configuration description.

Question: the scripts capture the code version number with what looks like an option to send the info to make, but I don't see that the version number is actually used in the code itself. Is that true, or am I missing something?

apcraig commented 4 years ago

The code version number is currently not included anywhere in the code or namelist, just in documentation and in a file in the source directory. We can certainly add that feature and it sounds like a good idea. If you need any help, let me know.

Also, any namelist output cleanup could be tied to #456.

duvivier commented 4 years ago

Is the biggest thing we need other than #468 mods, to include the version? Otherwise, it looks to me like @eclare108213 did nearly all of this.

apcraig commented 4 years ago

468, #456, #441, and #459 are mostly about cleaning up the namelist output and documentation including trying to more closely match the kind of output/information requested during MIP organization.

This issue has drifted off topic, but I think it is about providing space and process for different groups to document their namelist settings, experiment setups, and other things. I don't think we've made any progress or final decisions about whether or how to do that.