Creating a dataset guidelines outline

paolap commented 3 years ago

We should work out an outline for these guidelines. We could start by listing the main steps of a dataset creation

paolap commented 2 years ago

Just noticed this was closed with this pull request but I meant the structure of the Create a dataset guidelines more than of the overall book :-) sorry for the confusion, so I'm opening this again, as we haven't got an outline for that section yet

hot007 commented 2 years ago

Once outline is created, @AviRamchurn to review/contribute from BoM perspective.

hot007 commented 2 years ago

(Claire's example, please ad your own)

Steps to create data:

pick your project: ensure files are created belonging to the correct project, or if they need to be moved after creation, ensure that the correct group and permissions are inherited in the destination (See also ACLs page)
use an output format that will be readable by others - e.g. netCDF
define all relevant metadata fields - both for standards like CF and ACDD, but also any other metadata that may assist understanding data creation, reuse and attribution: implement these metadata fields in your post-processing workflow
structure your output data into a navigable directory structure: not more than ~1000 files at any directory level, directory tree is meaningfully named
Create a README at the top level with information about the dataset, when it was created, who to contact, how to use etc as relevant
Keep only sharing data in common space, tar and archive ancillary data (model input/config files etc)
Make a backup of your dataset (particularly at NCI, note the /g/data system IS NOT backed up
If your data is an analysis product, keep the code nearby the data or document where it can be accessed (snapshot the version of the code used to create this dataset)
Make a data management plan, and consider how your data is to be shared and/or published.

paolap commented 2 years ago

This is my go at this:

1) Planning DMP including basic info as backup, input files, tools used, license of what is used and potential output. This ideally should be part of project planning but might still be worth mentioning it here

2) Structuring file 2 a) Use case 1 completely new file

however rare, we could cover starting from a template, as for a cdl file (i.e. a ncdump output style file)
data saved from analysis - start saving data with reasonable default format (free, common etc.) and chunking, compression etc if netcdf
the least complicated possible dimensions, keeping into account also data use rather than dumping everything as it is
introduce early descriptive names for variables, and conventions where applicable
include units if applicable
at initial level some global attributes/metadata associate file to keep track of workflow and describe what's in the file.

2 b) Modified existing file

Make sure original attributes/documentation are still relevant
be careful particularly with units, cell_methods and coordinates that might have changed

3) Directory structure Depending on how many files you are going to produce you also want to make sure you have some directory structure implemented before the number of files become hard to track. It is always best to spearate different experiments/analysis. Also making sure to include provenance details in the file name itself reduce the risk of confusing different outputs

4) Backup Set up a backup strategy (could be part of DMP planning phase) Keep code under version control

5) Documentation/provenance

keep track of changes, workflow etc from the start, even if just in a simple notes text file. Make sure it is regularly updated and details are added accordingly with phase of project, in particular when starting to share data.

I find hard to separate what is, strictly speaking, creating new files and what is managing them, i.e. directories, backup, planning etc. We should probably mention both aspects but making sure we're not repeat too much of what might be in other sections.

chloemackallah commented 1 year ago

Need to add some examples (e.g. ncdump under metadata) to create

paolap commented 1 year ago

I'm closing this as this is done

ACDguide / Governance

Creating a dataset guidelines outline #7