Set netcdf global attributes to record origin of all published .nc files

aekiss commented 6 years ago

At present many output files have the same names and their meaning is distinguishable by path alone - eg there is no way to determine the experiment config used to produce a given ocean.nc file if it is moved from its directory. So the file paths are functioning as file metadata and should probably be recorded within the netcdf files themselves, eg as comments in global attributes. This will become increasingly important as we start publishing data on ua8 - eg if users download a bunch of files and forget where they came from.

So I suggest we have a common set of metadata we put in all output netcdf files, including e.g.

some boilerplate - eg "produced by the ACCESS-OM2-01 model as part of the COSIMA project, www.cosima.org"
the config hash and a link/doi to the config directory
the executable hashes
the doi for the dataset
the thredds command / url for getting this particular file
the directory the file belongs to (below output/)

.... anything else you can think of? (@paolap - any suggestions?)

I presume there's a way to do this with nco?

aekiss commented 6 years ago

also

citation info - journal paper doi, request for users to cite
licensing, eg creative commons CC BY or whatever we use
contact info of person responsible

russfiedler commented 6 years ago

In order to facilitate this I'd strongly recommend judicious the use of the h_minfree argument in calls to nf90_enddef/nc_enddef when creating large datasets. This will allow the later modification/addition of attributes without the need to make intermediate copies. nco can add this padding with -header_pad if it doesn't exist.

aekiss commented 6 years ago

Thanks @russfiedler, that's a good idea.

aekiss commented 6 years ago

Further thoughts:

storing payu runlog hashes allows the exact run that generated a file to be pinpointed
git hashes should also include a link to the repo they belong to, eg a github url for that repo and hash
storing the file's path (including filename) as file metadata allows files to be programmatically put back in the same directory structure they came from (or rebuild the directory structure from scratch), even if they are moved or renamed.

aekiss commented 6 years ago

...and passing on a comment Erik made a while ago: I know that e.g. Geoscientific Model Development does not allow github repositories, as they can’t be guaranteed to be online indefinitely. The consensus now seems to use zenodo; which is funded by CERN and the EU. There’s nice github integration via a webhook initiated when you create a release on github. Then, zenodo generates a DOI that is ‘guaranteed’ to be available for perpetuity ... so it may be best to create and use only dois (not urls) in metadata

nichannah commented 6 years ago

really good ideas here. I think the same goes for the inputs to our models. We often have conversations that include a query about where things originated.

aidanheerdegen commented 6 years ago

Definitely worth doing.

Here is a manifesto I wrote (3 YEARS AGO ... and progress has been way too slow) to try and crystallise thoughts around this. I realised I haven't put up github issues on payu to address these points, but I will attempt to that, so others know what steps are planned and can collaborate to achieve them.

Forensic Experiment Tracking, a Manifesto

Goal: uniquely identify every experiment. Retain history of changes to input parameters and files, and executable.

An experiment is defined as a directory, containing all the necessary components to run the model. This does not necessarily mean all inputs must be physically located in the experiment directory, but the means to find those inputs must be defined by an input file located in the directory.

All components that are essential to running the experiment (inputs) must themselves be uniquely identified. This will probably take the form of a hash function.

The executable used in the experiment must also have a unique identifier. Tracking changes to the executable is a separate task, but using the same principles as are outlined in this document, and considering the compilation of the executable as an experiment, means the state of the executable can also be similarly forensically tracked, and it's state identified by reference to it's unique ID.

The proposal is to use existing distributed revision control (DRC) software for the task of tracking changes in the experimental directory. These changes represent a comprehensive history of the experiment, as they capture changes to all the inputs, the executable and the status of each invocation, i.e. if the executable ran without error.

Modern DRC systems generates a unique identifier for the state of the text files they monitor. To be compliant with the principals of forensic experiment tracking all files necessary for the experiment to run must be monitored by the DRC system. Any file that is too large to be directly tracked must have it's ID stored in a supplementary file that can be added to the DRC system.

At it's most basic, starting an experiment can consist of creating a directory and copying into it the various input and configuration files that are required. It may be tools can be used to collect the necessary input files, and create the configuration files, but it is not a necessary requirement. What is required is that the software (usually a script of some sort) that runs the experiment performs the following steps:

Check input data have correct IDs. If not the input data with the correct ID must be found (and used), or the execution must stop and flag an error, or the ID in the configuration file must be changed to match the input data. Ideally this would be done before the main executable is run. If the input files are large and calculating the ID is prohibitively time consuming this step can be run in parallel with the main executable but working in this mode allows for no pre-checks and sourcing correct input data files in the case of an ID mismatch. Thus this mode only allows for the third option, which is changing the experimental configuration.
Once the experiment has finished the state, that is the success or otherwise of the experimental run must be logged in a file that is under the control of the DRC system.
The DRC system is invoked to save the state of the experiment, that is save the state of all the configuration files and history files that are "watched".
Highly recommended (but optional) another program is invoked to save the state of this experiment in a database that contains information about all experiments, allowing searches for common experiments, options, inputs. It makes it easier to identify which experiments might have erroneous inputs, or used versions of source code with bugs. The database is not necessary to run experiments, the experimental state is stored in git, but it ties experiments together. Potential collaborators could search each others database to find what models they are running. They could even fork a successful model and start using it themselves. A database can also be used to curate data. On entry a use-by date can be automatically generated. This does not guarantee deletion at that date, but allows for sequential deletion of data when storage becomes limited.

Finally, a unique identifier (DOI?) could be generated at point-of-publish for the datasets used in the publication. We could also generate unique IDs for the output data as this time consuming step would only need to be done once. The use-by date could be automatically incremented for important datasets like this. This could satisfy the ARC requirement to keep data and make it available. If so it could also satisfy Andy Pittman's request for a one push publish solution.

This has the potential to be VERY cool.

aekiss commented 6 years ago

Thanks Aidan, excellent ideas there.

The database set up by the Cosima cookbook might be extended to cover 4?

I think it would also be good to have script-generated human-readable summaries of experiment configs and the differences between experiments (and differences within experiments if parameters change within a run). This is one of the use cases for nmltab (namelists only - e.g. namelist-check.pdf here) but similar things could be done using config.yaml etc.

I'm picturing a juptyer notebook that displays a readably brief report of the differences between selected runs configs, with hyperlinks to more detailed information. A notebook could show not just differences in inputs/config but also in the outputs. We are already doing much of this in the cookbook.

marshallward commented 6 years ago

I have a description tag in config.yaml for GitHub projects, but it could be made more general

On 25 Nov 2017 11:17 am, "Andrew Kiss" notifications@github.com wrote:

Thanks Aidan, excellent ideas there.

The database set up by the Cosima cookbook might be extended to cover 4?

I think it would also be good to have script-generated human-readable summaries of experiment configs and the differences between experiments (and differences within experiments if parameters change within a run). This is one of the use cases for nmltab https://github.com/aekiss/nmltab (namelists only - e.g. namelist-check.pdf here https://github.com/aekiss/namelist-check) but similar things could be done using config.yaml etc.

I'm picturing a juptyer notebook that displays a readably brief report of the differences between selected runs configs, with hyperlinks to more detailed information. A notebook could show not just differences in inputs/config but also in the outputs. We are already doing much of this in the cookbook.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OceansAus/access-om2/issues/57#issuecomment-346908521, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcN63Mf49kOq6aoz7r-jn_MPe410Qecks5s51yMgaJpZM4QpS-z .

aidanheerdegen commented 6 years ago

As referenced above, I have started an issue for the file tracking. I've already started fiddling with some code for this and it had definitely bubbled to the top of my to-do list so was planning on working on this in the next few weeks

https://github.com/marshallward/payu/issues/90

aidanheerdegen commented 6 years ago

I think specifics of how to do this should be handled in payu issues, as there are a number of details of implementation, but fine to discuss requirements/goals for COSIMA here.

marshallward commented 6 years ago

Btw I was going to wait to get more info before mentioning this, bit I had a chat with Jon smilie at NCI about how to make the netCDF metadata more friendly for cmip analysis and archival, might be relevant to this discussion?

On 25 Nov 2017 11:55 am, "Aidan Heerdegen" notifications@github.com wrote:

As referenced above, I have started an issue for the file tracking. I've already started fiddling with some code for this and it had definitely bubbled to the top of my to-do list so was planning on working on this in the next few weeks

marshallward/payu#90 https://github.com/marshallward/payu/issues/90

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/OceansAus/access-om2/issues/57#issuecomment-346910183, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcN69OXksyZxaRaSVpAq9rgtrHDp8kUks5s52WHgaJpZM4QpS-z .

paolap commented 6 years ago

Hi,

I don't know all the details of this discussion, I'll make sure to get more info from Aidan. So just quickly

1) this a convention NCI uses on top of CF for higher level metadata as DOI, data source (i.e. model config in your case), etc https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery

Normally we add this level of metadata last.

2) about the names being the same between different configurations. If we're talking about output files I would avoid that, as I said I'm not sure exactly what the intended use of these would be. If you have a more descriptive name as well as distinguish immediately you can more easily do things like aggregate them via thredds. Dependance from a DRS isn't ideal, so at least including the information you can get from the directory in the file is a good step. We had lots of issues in CMIP5 with versions being only available from the DRS.

3) I've planned for a long time to start a zenodo repository for the ARCCSS, as someone pointed out that's currently your best option to publish any kind of code. I'm also part of discussion group which is looking exactly at how to published software, so I can point you to existing conventions.

4) About DOIs, while there's a lot of discussion on how to add a time stamp to a DOI for growing collections, probably the best solution currently is to have a DOI for the entire collection and then DOIs for each simulation. Most of the work would be done for the first parent DOI, I have already done the same for a different collection of simulations using XSL to generate the metadata records from the the collection record in xml format, using another xml file to provide the input for specific simulation info.

Hope this isn't too confusing and actually answer some of your questions

aekiss commented 6 years ago

Moving @PaulSpence's comment here to close https://github.com/OceansAus/access-om2/issues/68 and consolidate discussion here.

I think it is import to include more header metadata in some of the input files:

For example,

ncdump -h access-om2/input/mom_025deg/ocean_temp_salt.res.nc

should include the source of the temp and salt data (e.g. WOA13). This is important legacy info.

aekiss commented 5 years ago

I'm using https://github.com/aekiss/run_summary to harvest a lot of relevant metadata that could be attached to the .nc output files - see /g/data3/hh5/tmp/cosima/access-om2-run-summaries/.

aidanheerdegen commented 4 years ago

spltvar is populating some fields automatically by examining the data

// global attributes:
                :simname = "access-om2" ;
                :time_coverage_start = "195801" ;
                :time_coverage_end = "195901" ;
                :geospatial_lat_min = -77.8766233766234 ;
                :geospatial_lat_max = 89.7744761298823 ;
                :geospatial_lon_min = -279.5 ;
                :geospatial_lon_max = 79.5 ;

aidanheerdegen commented 4 years ago

I've made a repo where I'll put the metadata in the form of yaml files

https://github.com/COSIMA/metadata

and use addmeta to plug it into the data files

https://github.com/coecms/addmeta

An example of the format of the yaml file:


global:
   organisation : "ARC Centre of Excellence for Climate System Science / Climate Change Research Centre"
   Conventions : "CF-1.6, ACDD-1.3"
   license : "http://creativecommons.org/licenses/by-nc-nd/4.0/"```

aekiss commented 4 years ago

For the data associated with the GMD paper I think we should include the "highly recommended" stuff from https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery

title
summary
keywords

as well as these recommended items

creator_url: cosima.org.au
project: COSIMA (Consortium for Ocean-Sea Ice Modelling in Australia)
acknowledgments: ARC linkage grant, and NCI
license (perhaps CC BY 3.0 AU? require attribution to COSIMA and/or citation of GMDD paper?)

In addition it could be good to include

paper DOI https://doi.org/10.5194/gmd-2019-106 (I'm not sure if this will change for the final published version?)
model DOI https://doi.org/10.5281/zenodo.2653246
DOI for the output data (i.e. where this file was from)
filename at this DOI, or thredds query to get it

There's also more detailed data in /g/data3/hh5/tmp/cosima/access-om2-run-summaries/ that could be included (e.g. git hashes of all model components and config) but these can change from run to run (and there can be several runs consolidated into each file) so it's not clear to me how that could be done (or whether it would be worth the effort).

aidanheerdegen commented 4 years ago

@paolap had the following feedback

I’m checking one random file from

/g/data/ua8/cosima-tmp/publish/access-om2/ocean/runoff/runoff_access-om2_210701_210712.nc

Standard_name: all standard name for coordinates are missing. It’s ok if a standard_name doesn’t exist to omit it but definitely these are the more important because they identify dimensions.

time  ( time )
yt_ocean ( latitude )
xt_ocean  ( longitude )

Wrong units:

nv:units = "none"

Whenever a units is dimensionless, meaning that has no physical dimension (i.e. number, ratio, percentage) put ‘1’ There are some cases where there’s also another option which is accepted (like “percent”) it’s kind of easier to be explicit in the long_name about what it is rather than trying to work it out.

These variable have no information at all, needs at least long_name and standard_name (if exists) and units:

       float geolon_t(yt_ocean, xt_ocean) ;
              geolon_t:_FillValue = NaNf ;

Same for geolat

To be aware of The CF checker might complain about variables like this:

average_T2:units = "days since 1958-01-01"

, which is not a time axis but has similar units, if that’s the case and in similar warning just points out that it’s a period not a time axis

average_DT:units = "days"

or whenever a standard_name is missing usually it’s fine to either list them beforehand if you know for sure (which would be tricky in your case) or letting them know based of the issues list that they don’t exists. Either way it’s ok, the checker is not perfect

paolap commented 4 years ago

Hi, I've created a google doc to collect the info above, plus my comments on one of the ice output files, the ACDD convention and license. https://docs.google.com/document/d/1vumYYVjZonxpPKu6U3h7DKui1dG3RWGMh-cLv-8WKC0/edit

aidanheerdegen commented 4 years ago

I've pushed an initial addmeta file for the ocean fixes here:

https://github.com/COSIMA/metadata

Specifically

https://github.com/COSIMA/metadata/blob/master/ocean.yaml

aidanheerdegen commented 4 years ago

I have made a bunch of yaml files to be fed into addmeta in that metadata repo

https://github.com/COSIMA/metadata

Next step is for someone (@aekiss ?) to fill them in and push back to the repo, or do a PR. Whatevs. Let me know if it doesn't make sense.

ping @AndyHoggANU

aidanheerdegen commented 4 years ago

I've started an individual issue on the metadata repo, might be the best way forward for other issues as they arise/debating what to put in various fields etc

https://github.com/COSIMA/metadata/issues/2

COSIMA / access-om2

Set netcdf global attributes to record origin of all published .nc files #57

Forensic Experiment Tracking, a Manifesto