aodn / aodn_cloud_optimised

Cloud optimised data formats
GNU General Public License v3.0
2 stars 1 forks source link

Filling gaps in Zarr datasets by writing new NetCDF files to the appropriate region #107

Open LeoLee-Xiaohu opened 1 week ago

LeoLee-Xiaohu commented 1 week ago

We are testing the Prefect flow that uses the cloud-optimised library to update Zarr datasets. The creation of a Zarr dataset with a single NetCDF file and appending a Zarr dataset with another single NetCDF file have both succeeded.

We are now testing a scenario where we update a Zarr dataset with a gap by filling the gap.
For example, the source NetCDF file for 2024-01-02 did not arrive in our bucket on 2024-01-02 but was instead uploaded later, after 2024-01-03, as a delayed upload. As a result, the Zarr dataset contains a gap for 2024-01-02 because the corresponding NetCDF file was not available on time.

To simulate a gap in the Zarr dataset, I generated a test Zarr dataset by:

  1. Creating it using a single NetCDF file for 2024-01-01.
  2. Appending it with a single NetCDF file for 2024-01-03.

This setup simulates a gap for 2024-01-02. I then attempted to update the Zarr dataset by running the following command:

python cloud_optimised_update_flow.py --path "IMOS/SRS/SST/ghrsst/L3S-1d/dn/2024/20240102092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc" --dataset-config "satellite_ghrsst_l3s_1day_daynighttime_single_sensor_australia.json"

The expected result was for the NetCDF file of 2024-01-02 to be written into the correct time region, filling the gap in the Zarr dataset. The time order should have been:
2024-01-01, 2024-01-02, 2024-01-03.

However, the actual result is that the NetCDF file for 2024-01-02 was appended after 2024-01-03, resulting in the following time order:
2024-01-01, 2024-01-03, 2024-01-02.

Could you confirm whether the cloud-optimised library supports filling gaps in Zarr datasets by writing new NetCDF files to the appropriate region? This functionality would ensure that the Zarr dataset maintains a correct chronological order.

lbesnard commented 5 days ago

FYI, https://discourse.pangeo.io/t/how-to-efficiently-overwrite-existing-zarr-archive-with-reordered-time-axis-updated-question/2714