bcdev / nc2zarr

A Python tool that converts NetCDF files to Zarr format
MIT License
9 stars 3 forks source link

Update output metadata on finalize #42

Closed forman closed 3 years ago

forman commented 3 years ago

Closes #20 Closes #34

Whether to adjust output metadata (global attributes) after the last write/append can now be forced by the new output.adjust_metadata setting whose default is false. If set to true, this will adjust the following metadata attributes

In addition, extra metadata (global attributes) can now be added after the last write/append using the new setting output.metadata whose value is a mapping from attribute names to values.

The above functionality is also reflected in two new CLI options --adjust-metadata and --finalize-only. In combination, they can be used to later adjust metadata in already existing Zarr datasets.

pont-us commented 3 years ago

This wouldn't work for the CMEMS CHL data: its metadata looks like this:

start_date: 2020-04-04
stop_date: 2020-04-04
start_time: 08:00:00 UTC
stop_time: 14:00:00 UTC

I haven't tested the PR with these datasets yet, but as far as I can see from the source it would ignore the _date keys and overwrite the _time keys with %Y-%m-%d %H:%M:%S-formatted strings.

Also, what about spatial metadata? Mentioned in #20 but not addressed in this PR as far as I can tell.

forman commented 3 years ago

This wouldn't work for the CMEMS CHL data: its metadata looks like this:

start_date: 2020-04-04
stop_date: 2020-04-04
start_time: 08:00:00 UTC
stop_time: 14:00:00 UTC

I haven't tested the PR with these datasets yet, but as far as I can see from the source it would ignore the _date keys and overwrite the _time keys with %Y-%m-%d %H:%M:%S-formatted strings.

Also, what about spatial metadata? Mentioned in #20 but not addressed in this PR as far as I can tell.

Ok, this is very special, I hate that :) If seen start_time, stop_time pairs often (but then containing datetime value) and your case only once (CMEMS CHL). I then suggest to not touch start_time, stop_time at all and update those with a postprocessor, or maybe a new custom finalizer.

pont-us commented 3 years ago

If seen start_time, stop_time pairs often (but then containing datetime value) and your case only once (CMEMS CHL).

This is the case for all 10 CHL datasets on Creodias -- possible for all 60 CMEMS datasets, but I haven't checked. Then again, that in itself isn't a strong argument against using a custom processor, since the processor only needs to be written once. So I have nothing against externalizing the CMEMS fix-up into a processor, especially since there are probably other unusual cases out there which will have to be handled specially too.

We'll need to implement a new processor hook (separate issue/PR of course). The processor function needs simultaneous access to both the existing dataset and the to-be-appended dataset, which none of the current ones get as far as I know.

forman commented 3 years ago

We'll need to implement a new processor hook (separate issue/PR of course).

Yes.

The processor function needs simultaneous access to both the existing dataset and the to-be-appended dataset,

Why do you need both?

pont-us commented 3 years ago

Why do you need both?

Hmm, actually perhaps not necessary if we're taking the times from the data itself, but I (and the CloudFerros) would also like to be able to calculate from the existing metadata, e.g. updated_start_date = min(old_dataset.attrs.start_date, new_dataset.attrs.start_date).