Concatenating datasets with staggered grids

TomNicholas commented 10 months ago

Here are two toy datasets designed to represent sections of a dataset that has variables living on a staggered grid. This type of dataset is common in fluid modelling (handling staggered grids is why xGCM exists).

import xarray as xr

ds1 = xr.Dataset(
    data_vars={
        'a': ('x_center', [1, 2, 3]),
        'b': ('x_outer',  [0.5, 1.5, 2.5, 3.5]),  
    },
)

ds2 = xr.Dataset(
    data_vars={
        'a': ('x_center', [4, 5, 6]),
        'b': ('x_outer',  [4.5, 5.5, 6.5]),  
    },
)

I have netcdf output files from an ocean model (UCLA-ROMS) that have this basic structure.

Combining these types of datasets seems like a bit of a pain to do with kerchunk at the moment. To concatenate along the x direction, I actually need to concatenate a along x_center, and b along x_outer. So presumably I have to call MultiZarrToZarr once for each variable (or group of variables) that needs to be concatenated along a common dimension. My real dataset is split along multiple dimensions and has multiple staggered grid locations for each dimension, meaning I have to call MultiZarrToZarr something like 6 times.

This problem is analogous to what happens inside xarray.combine_by_coords, which automatically ~~groups variables into sets with common dimensions~~ splits datasets up into sets consisting of the same variable from each dataset, concatenates each set separately (along multiple dimensions in general), then merges the results.

Is that approach (call MultiZarrToZarr multiple times then call kerchunk.combine.merge_vars) the recommended way to handle this case currently? Could we imagine some improvement to the kerchunk.combine API that might make this easier?

TomNicholas commented 10 months ago

When I actually tried this on my toy example locally I got an error:

import ujson
from pathlib import Path

from kerchunk.combine import MultiZarrToZarr

# save example datasets as netCDF
ds1.to_netcdf('./staggered_kerchunk_test/0.nc')
ds2.to_netcdf('./staggered_kerchunk_test/1.nc')

fs2 = fsspec.filesystem('')  # local file system to save final jsons to

def gen_json(in_filepath: str):
    in_filename = Path(in_filepath).stem

    h5chunks = SingleHdf5ToZarr(in_filepath, inline_threshold=300)

    out_filepath = f'staggered_kerchunk_test/{in_filename}.json'
    with fs2.open(out_filepath, 'wb') as f:
        f.write(ujson.dumps(h5chunks.translate()).encode())

for file in ['staggered_kerchunk_test/0.nc', 'staggered_kerchunk_test/1.nc']:
    gen_json(file)

# concatenate only variable 'a' along dimension 'x_center'
mzz = MultiZarrToZarr(
    ['staggered_kerchunk_test/0.json', 'staggered_kerchunk_test/1.json'],
    concat_dims=['x_center'],
    preprocess=kerchunk.combine.drop('b'),
)

d = mzz.translate('staggered_kerchunk_test/combined.json')

--------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[99], line 1
----> 1 d = mzz.translate('staggered_kerchunk_test/combined.json')

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:496, in MultiZarrToZarr.translate(self, filename, storage_options)
    490 """Perform all stages and return the resultant references dict
    491 
    492 If filename and storage options are given, the output is written to this
    493 file using ujson and fsspec instead of being returned.
    494 """
    495 if 1 not in self.done:
--> 496     self.first_pass()
    497 if 2 not in self.done:
    498     self.store_coords()

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:258, in MultiZarrToZarr.first_pass(self)
    256 z = zarr.open_group(fs.get_mapper(""))
    257 for var in self.concat_dims:
--> 258     value = self._get_value(i, z, var, fn=self._paths[i])
    259     if isinstance(value, np.ndarray):
    260         value = value.ravel()

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/kerchunk/combine.py:226, in MultiZarrToZarr._get_value(self, index, z, var, fn)
    224     o = z[var].attrs[item]
    225 elif selector.startswith("data:"):
--> 226     o = z[selector.split(":", 1)[1]][...]
    227 elif selector.startswith("cf:"):
    228     import cftime

File /glade/work/tnicholas/conda-envs/kerchunk3.9/lib/python3.9/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
    498         raise KeyError(item)
    499 else:
--> 500     raise KeyError(item)

KeyError: 'x_center'

This doesn't make sense to me - if I only drop variable 'b', then only 'a' should be left, so why can't it find 'x_center'`?

martindurant commented 10 months ago

MultiZarrToZarr is very much oriented towards a netCDF-style merger of sub-datasets, wherein the variables share coordinates with one-another, as opposed to the linear concat going on here. Other styles of merging are completely reasonable and we should develop ways to run them and to help users to choose between them.

Before going further, I believe the exception you are seeing is because x_center is has no defined values (they become [0, 1, 2] in each input and aren't represented in the zarr version of each input at all), so kerchunk has no way to know where each of the constituent datasets are supposed to go. If you define them as coo_map={"x_center": [[1, 2, 3], [4, 5, 6]], the combine works.

This particular dataset has variable b with unequal chunks, so it's not possible to represent this via zarr at all until ZEP003 is resolved.

TomNicholas commented 10 months ago

MultiZarrToZarr is very much oriented towards a netCDF-style merger of sub-datasets, wherein the variables share coordinates with one-another, as opposed to the linear concat going on here.

Is there some other tool that I should be using instead? kerchunk.combine.concatenate_arrays? This is the first time I've tried to use kerchunk myself so I appreciate any pointers!

the combine works.

Use coo_map does work, thank you!

so kerchunk has no way to know where each of the constituent datasets are supposed to go

Why is coo_map required though? I gave kerchunk an ordered list, deliberately without a coordinate variable for x_center. Why do we have to invent a set of integer values to use as the data for that coordinate in order to combine the datasets?

This particular dataset has variable b with unequal chunks, so it's not possible to represent this via zarr at all until ZEP003 is resolved.

In this case presumably it should still work because my chunks are of the form {X, X, X, ... Y}, where Y<=X. I understand this general restriction though.

martindurant commented 10 months ago

Why is coo_map required though? I gave kerchunk an ordered list, deliberately without a coordinate variable for x_center. Why do we have to invent a set of integer values to use as the data for that coordinate in order to combine the datasets?

Simply that we haven't a made a case for it. We have either take the values as they appear ( [0, 1, 2] for both sets) or take the index of the filename (would be [0, 0, 0], [1, 1, 1]!) but not a combination. I agree that we should be able to spot the case that the coord is missing from the input completely and cope with that - but this can only work when concating on exactly one axis.

In this case presumably it should still work because my chunks are of the form {X, X, X, ... Y}, where Y<=X

Unfortunately, no. Zarr does allow the last chunk to be partial, but is still saves a buffer equivalent to a whole chunk and then slices off the needed parts.

TomNicholas commented 10 months ago

So if I promoted my dimensions to be coordinate variables then I could do a multi-dimensional concatenation like I suggested above?

I agree that we should be able to spot the case that the coord is missing from the input completely and cope with that

That would be nice. It is what xarray.concat/xarray.combine_nested can do.

but this can only work when concatenating on exactly one axis.

The way we dealt with that ambiguity in xarray was to allow combine_nested to accept lists-of-lists, and repeatedly combine along the innermost lists.

Unfortunately, no. Zarr does allow the last chunk to be partial, but is still saves a buffer equivalent to a whole chunk and then slices off the needed parts.

Sorry, you're saying that whilst Zarr can represent arrays where the final chunk is partial, you currently cannot build a kerchunk reference from netCDF files where the final chunk is smaller, because of the way Zarr actually handles that partial chunk?

martindurant commented 10 months ago

because of the way Zarr actually handles that partial chunk?

An example: note how keys "0" and "1" have the same number of bytes. They are expected to be the size of the full chunk on read.

In [1]: store = {}

In [2]: import zarr

In [3]: z = zarr.open_array(store, mode="w", dtype="i4", compression=None, shape=(6,), chunks=(4,))

In [4]: z[:] = 1

In [5]: store
Out[5]:
{'.zarray': b'{\n    "chunks": [\n        4\n    ],\n    "compressor": null,\n    "dimension_separator": ".",\n    "dtype": "<i4",\n    "fill_value": 0,\n    "filters": null,\n    "order": "C",\n    "shape": [\n        6\n    ],\n    "zarr_format": 2\n}',
 '0': b'\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00',
 '1': b'\x01\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'}

fsspec / kerchunk

Concatenating datasets with staggered grids #362