List input files in history metadata attribute

bcdev / nc2zarr

A Python tool that converts NetCDF files to Zarr format

MIT License

9 stars 3 forks source link

List input files in history metadata attribute #34

Closed pont-us closed 3 years ago

pont-us commented 3 years ago

At present, Zarrs produced by nc2zarr don't contain any indication of the source files from which they were generated. nc2zarr should optionally include a list of source files in the value of the Zarr's history attribute on first generation, and update this value with the additional input files when appending to an existing Zarr.

forman commented 3 years ago

Several metadata attributes may be updated in a CF-compliant way, see section Description of File Contents in the CF conventions.

forman commented 3 years ago

Should be resolved together with #20.

pont-us commented 3 years ago

Implementing this will also help with implementation of a related feature request from CloudFerro: the ability to ignore an input file when appending if it has already been ingested into the target Zarr. nc2zarr could check the input pathname or filename against the list in the history attribute before appending.

forman commented 3 years ago

@pont-us Please note:

[x] Append to the history attribute when what has been done: "\n${date}: converted to Zarr using nc2zarr ${version}".
[x] Use the sources attribute to list the sources.
[x] Resolve #20

forman commented 3 years ago

Implementing this will also help with implementation of a related feature request from CloudFerro: the ability to ignore an input file when appending if it has already been ingested into the target Zarr. nc2zarr could check the input pathname or filename against the list in the history attribute before appending.

We should not use metadata to make decisions about the data in the dataset. Whether a timeslice has already been processed or not should be detected by looking into the data: time coordinates. Once it is detected there are two options: ignore new data or replace existing. To replace an existing timeslice by a more up-to-date one is a valid use case we have in other scenarios. (Example: Same Sentinel 3 Level-2 data is beeing processed in a fast lane and another one that takes much more time but has higher data quality. When the second data arrives, the first is replaced.)

pont-us commented 3 years ago

We should not use metadata to make decisions about the data in the dataset. Whether a timeslice has already been processed or not should be detected by looking into the data: time coordinates.

Agreed -- I've opened Issue #41 to discuss implementation of this functionality.