GEUS-Glaciology-and-Climate / pypromice

Process AWS data from L0 (raw logger) through Lx (end user)
https://pypromice.readthedocs.io
GNU General Public License v2.0
14 stars 4 forks source link

Align NetCDF data product attributes with ACDD Metadata Standards #259

Open ladsmund opened 3 months ago

ladsmund commented 3 months ago

We need to update our current processing pipeline to align with the Attribute Convention for Data Discovery (ACDD) 1-3 guidelines. This will improve the consistency, discoverability, and interoperability of our datasets.

The convention has a subset of attributes which are Highly Recommended that we should prioritize to follow.

In addition, I also suggest we maintain a source attribute and maybe product_version attribute for reproducability and to determine the need for reprocessing.

https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3#Index_by_Attribute_Name

Attribute Current status Target Example
id hash string based on stid. This conflicts with the conventions since it is not unique across datasets. unique string for dataset stid+level+sample. It could be a explicit string or a uuid. dk.geus.promice.station.daily.QAS_Lv3
dk.geus.promice.site.daily.QAS_L
title - A short phrase or sentence describing the dataset. Automatic weather station mesurements from QAS_Lv3. Daily average
summary - A paragraph describing the dataset ....
date_created The datetime when the script was executed The datetime when the script was executed 2024-06-19T05:12:55.594009
source - Necessary information for reproducing the dataset. Including versions of pypromice , data sources and configurations. {'pypromice': 1.3.6, 'aws-l0':2e1aa426246, 'aws-metadata': 132201a1}
product_version - Version identifier that can be used to determine if reprocessing is necessary. I might be redundant with source
institution GEUS GEUS Geological Survey of Denmark and Greenland (GEUS)
date_issued Same as date_created
date_modified Same as data_created
processing_level A textual representation of the procesing level Maybe fine. Level 2
ladsmund commented 3 months ago

We should also consider:

  1. IDs for source level datasets like tx and raw.
  2. IDs for different levels. Station datasets can be stored at both level 2 and level 3. Maybe level 3 could be implicit since it is the official output level.
  3. Making IDs unique for each iteration of a dataset. This makes it possible to precisely refer to the actual data used for analysis and processing. We should use dataset IDs extensively in our pipeline to determine whether an output has already been processed. We can use information about the input datasets, pypromice version, etc., to make the iteration ID deterministic.

uuid3 is a hash function that generates a 128-bit number from an input string, designed to be globally unique. The output depends solely on the input string (and namespace) and will always return the same value for the same input. A benefit of using a hash function for the IDs is to control and limit the format of the ID string. This might be especially relevant for point (3).

https://github.com/GEUS-Glaciology-and-Climate/pypromice/pull/252#discussion_r1652118161

BaptisteVandecrux commented 2 months ago

Just found this publication that describes a procedure to define and manage attributes in netcdf from observation programs:

Uttal, T., Hartten, L. M., Khalsa, S. J., Casati, B., Svensson, G., Day, J., Holt, J., Akish, E., Morris, S., O'Connor, E., Pirazzini, R., Huang, L. X., Crawford, R., Mariani, Z., Godøy, Ø., Tjernström, J. A. K., Prakash, G., Hickmon, N., Maturilli, M., and Cox, C. J.: Merged Observatory Data Files (MODFs): an integrated observational data product supporting process-oriented investigations and diagnostics, Geosci. Model Dev., 17, 5225–5247, https://doi.org/10.5194/gmd-17-5225-2024, 2024.