Update output metadata on finalize

bcdev / nc2zarr

A Python tool that converts NetCDF files to Zarr format

MIT License

9 stars 3 forks source link

Update output metadata on finalize #42

Closed forman closed 3 years ago

forman commented 3 years ago

Closes #20 Closes #34

Whether to adjust output metadata (global attributes) after the last write/append can now be forced by the new output.adjust_metadata setting whose default is false. If set to true, this will adjust the following metadata attributes

"history"
"source"
"time_coverage_start"
"time_coverage_end"

In addition, extra metadata (global attributes) can now be added after the last write/append using the new setting output.metadata whose value is a mapping from attribute names to values.

The above functionality is also reflected in two new CLI options --adjust-metadata and --finalize-only. In combination, they can be used to later adjust metadata in already existing Zarr datasets.

pont-us commented 3 years ago

This wouldn't work for the CMEMS CHL data: its metadata looks like this:

start_date: 2020-04-04
stop_date: 2020-04-04
start_time: 08:00:00 UTC
stop_time: 14:00:00 UTC

I haven't tested the PR with these datasets yet, but as far as I can see from the source it would ignore the _date keys and overwrite the _time keys with %Y-%m-%d %H:%M:%S-formatted strings.

Also, what about spatial metadata? Mentioned in #20 but not addressed in this PR as far as I can tell.

forman commented 3 years ago

This wouldn't work for the CMEMS CHL data: its metadata looks like this:
start_date: 2020-04-04
stop_date: 2020-04-04
start_time: 08:00:00 UTC
stop_time: 14:00:00 UTC
I haven't tested the PR with these datasets yet, but as far as I can see from the source it would ignore the _date keys and overwrite the _time keys with %Y-%m-%d %H:%M:%S-formatted strings.

Also, what about spatial metadata? Mentioned in #20 but not addressed in this PR as far as I can tell.

Ok, this is very special, I hate that :) If seen start_time, stop_time pairs often (but then containing datetime value) and your case only once (CMEMS CHL). I then suggest to not touch start_time, stop_time at all and update those with a postprocessor, or maybe a new custom finalizer.

pont-us commented 3 years ago

If seen start_time, stop_time pairs often (but then containing datetime value) and your case only once (CMEMS CHL).

This is the case for all 10 CHL datasets on Creodias -- possible for all 60 CMEMS datasets, but I haven't checked. Then again, that in itself isn't a strong argument against using a custom processor, since the processor only needs to be written once. So I have nothing against externalizing the CMEMS fix-up into a processor, especially since there are probably other unusual cases out there which will have to be handled specially too.

We'll need to implement a new processor hook (separate issue/PR of course). The processor function needs simultaneous access to both the existing dataset and the to-be-appended dataset, which none of the current ones get as far as I know.

forman commented 3 years ago

We'll need to implement a new processor hook (separate issue/PR of course).

Yes.

The processor function needs simultaneous access to both the existing dataset and the to-be-appended dataset,

Why do you need both?

pont-us commented 3 years ago

Why do you need both?

Hmm, actually perhaps not necessary if we're taking the times from the data itself, but I (and the CloudFerros) would also like to be able to calculate from the existing metadata, e.g. updated_start_date = min(old_dataset.attrs.start_date, new_dataset.attrs.start_date).