fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
310 stars 80 forks source link

Unexpected behavior with MultiZarrToZarr with partial chunks. #400

Open sharkinsspatial opened 11 months ago

sharkinsspatial commented 11 months ago

While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using MultiZarrToZarr with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.

A minimal example

import xarray as xr
import numpy as np
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

# Set up fake netCDF files with latitude as a non-dimension coordinate.
time_1_3 = xr.Dataset(
    {
        "delta_time": [12, 13, 14]
    },
    coords={
        'latitude': ('delta_time', [1, 2, 3]),
 })
time_15_17 = xr.Dataset(
    {
        "delta_time": [15, 16, 17]
    },
    coords={
        'latitude': ('delta_time', [4, 5, 6]),
})
time_18_22 = xr.Dataset(
    {
        "delta_time": [18, 19, 20, 21, 22]
    },
    coords={
        'latitude': ('delta_time', [7, 8, 9, 10, 11]),
 })
time_1_3

Screen Shot 2023-11-28 at 4 21 41 PM

# Create netCDFs with chunksize aligned to data size
chunksize = 3
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_1_3.to_netcdf('time_1_3.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

def create_reference(files: list[str]):
    single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in files]
    mzz = MultiZarrToZarr(
        single_jsons,
        concat_dims=["delta_time"],
    )
    combined_test_json = mzz.translate()

    combined_test = xr.open_dataset(
        "reference://", engine="zarr",
        backend_kwargs={
            "storage_options": {
                "fo": combined_test_json,
                },
            "consolidated": False,
        }
    )
    return combined_test

# This works as expected
combined_test_3 = create_reference(["time_1_3.nc", "time_15_17.nc"])
combined_test_3.latitude.data

Screen Shot 2023-11-28 at 4 22 41 PM

# Create netCDFs with a larger chunksize resulting in partially filled chunks not aligned to data size
chunksize = 10
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_18_22.to_netcdf('time_18_22.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

# This results in empty values for the resulting non-dimension coordinate variable.
combined_test_10 = create_reference(["time_15_17.nc", "time_18_22.nc"])
combined_test_10.latitude.data

Screen Shot 2023-11-28 at 4 25 03 PM

This seems potentially related to some of the discussion in https://github.com/fsspec/kerchunk/issues/305 (as it is also describing the case of data not aligned with chunk size).

If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).

I may be misunderstanding the MultiZarrToZarr logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?

martindurant commented 11 months ago

If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).

I think in this case, the coordinates get inlined as a single chunk in the output references, so the chunking of the original doesn't matter. The same could be done with any variable. This is not chunk inlining but whole array inlining to a single chunk, e.g., kerchunk.utils.inline_array . However, MultiZarrToZarr doesn't even bother accumulating values of non-coordinates, since they are not used for figuring out the concatenation chunk placement.

More generally, non-equal chunks are just not possible because of the limitations of zarr at least until ZEP003 is accepted and implemented ( https://github.com/zarr-developers/zarr-python/pull/1483 ). If this were done, no magic inlining would be needed. Feel free to ping on the issue or the ZEP repo saying you need this.