Align NetCDF data product attributes with ACDD Metadata Standards

ladsmund commented 4 months ago

We need to update our current processing pipeline to align with the Attribute Convention for Data Discovery (ACDD) 1-3 guidelines. This will improve the consistency, discoverability, and interoperability of our datasets.

The convention has a subset of attributes which are Highly Recommended that we should prioritize to follow.

In addition, I also suggest we maintain a source attribute and maybe product_version attribute for reproducability and to determine the need for reprocessing.

https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3#Index_by_Attribute_Name

Attribute	Current status	Target	Example
id	hash string based on stid. This conflicts with the conventions since it is not unique across datasets.	unique string for dataset stid+level+sample. It could be a explicit string or a uuid.	`dk.geus.promice.station.daily.QAS_Lv3` `dk.geus.promice.site.daily.QAS_L`
title	-	A short phrase or sentence describing the dataset.	Automatic weather station mesurements from QAS_Lv3. Daily average
summary	-	A paragraph describing the dataset	....
date_created	The datetime when the script was executed	The datetime when the script was executed	2024-06-19T05:12:55.594009
source	-	Necessary information for reproducing the dataset. Including versions of pypromice , data sources and configurations.	{'pypromice': 1.3.6, 'aws-l0':2e1aa426246, 'aws-metadata': 132201a1}
product_version	-	Version identifier that can be used to determine if reprocessing is necessary. I might be redundant with source
institution	GEUS	GEUS	Geological Survey of Denmark and Greenland (GEUS)
date_issued	Same as date_created
date_modified	Same as data_created
processing_level	A textual representation of the procesing level	Maybe fine.	Level 2

ladsmund commented 4 months ago

We should also consider:

IDs for source level datasets like tx and raw.
IDs for different levels. Station datasets can be stored at both level 2 and level 3. Maybe level 3 could be implicit since it is the official output level.
Making IDs unique for each iteration of a dataset. This makes it possible to precisely refer to the actual data used for analysis and processing. We should use dataset IDs extensively in our pipeline to determine whether an output has already been processed. We can use information about the input datasets, pypromice version, etc., to make the iteration ID deterministic.

uuid3 is a hash function that generates a 128-bit number from an input string, designed to be globally unique. The output depends solely on the input string (and namespace) and will always return the same value for the same input. A benefit of using a hash function for the IDs is to control and limit the format of the ID string. This might be especially relevant for point (3).

https://github.com/GEUS-Glaciology-and-Climate/pypromice/pull/252#discussion_r1652118161

BaptisteVandecrux commented 3 months ago

Just found this publication that describes a procedure to define and manage attributes in netcdf from observation programs:

Uttal, T., Hartten, L. M., Khalsa, S. J., Casati, B., Svensson, G., Day, J., Holt, J., Akish, E., Morris, S., O'Connor, E., Pirazzini, R., Huang, L. X., Crawford, R., Mariani, Z., Godøy, Ø., Tjernström, J. A. K., Prakash, G., Hickmon, N., Maturilli, M., and Cox, C. J.: Merged Observatory Data Files (MODFs): an integrated observational data product supporting process-oriented investigations and diagnostics, Geosci. Model Dev., 17, 5225–5247, https://doi.org/10.5194/gmd-17-5225-2024, 2024.

GEUS-Glaciology-and-Climate / pypromice

Align NetCDF data product attributes with ACDD Metadata Standards #259