fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
312 stars 81 forks source link

Shifted/missing chunks after running MultiZarrToZarr #430

Open JoelJaeschke opened 9 months ago

JoelJaeschke commented 9 months ago

Hey πŸ‘‹, first of all, thanks for this awesome project! It really makes working with large collections of data so much easier and I greatly appreciate the effort!

Unfortunately, I am currently facing an issue using kerchunk for creating an aggregated view over ERA5 reanalysis data. I have attached a script and two files to serve as a minimal reproducer. I strongly suspect I am using kerchunk wrong at the moment, however, I had a hard time figuring out what I am doing wrong exactly.

The minimal example contains two netCDF files for maximum daily temperature (tasmax, internally chunked/stored as (31, 2, 2) for (time, lat, lon) dimensions, respectively) and I am following the example from the Quickstart. I create single-file references using SingleHdf5ToZarr and then merge them using MultiZarrToZarr. However, when I then open my merged file, the values for 2003 are correct, but when looking at the values for 2004, they appear to be shifted or even missing (values after 2004-12-08 are all NaN, even though they are present in the original file).

There is one place where I am deviating from the example code, namely using coo_map={"time": "cf:time"} instead of concat_dims=["time"]. When I use the latter, what happens is that my time dimension ranges from 2003-01-01 to 2004-01-01. I suspect this is happening as both source files have time units days since <year>-01-01, meaning that the time dimension will look like range(0, 365) and range(0,366) for 2003 and 2004 (leap year), respectively. Therefor, if not parsing this axis using cftime right away, only the underlying integer values (days since datum) will be assigned, thus causing my time dimension to only have 366 entries. However, when I convert the time to be aligned to the same start datum, i.e. both years have same units of days since 2003-01-01 00:00:00, I can use concat_dims=["time"] right away, however, my underlying issue still exists, so I suppose this is not the problem.

I would greatly appreciate if you could point me to a direction for solving my problem?

reproducer.zip

Code from reproducer:

import xarray as xr
import numpy as np

import ujson
import fsspec

from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

if __name__ == "__main__":
    ds2003 = xr.open_dataset("era5_tasmax_2003.nc")
    ds2004 = xr.open_dataset("era5_tasmax_2004.nc")

    fs = fsspec.filesystem("file")
    flist = fs.glob("era5_tasmax_*.nc")

    # Generate single file references
    for fname in flist:
        with fs.open(fname) as infile:
            chunks = SingleHdf5ToZarr(infile, fname)
            outf = f"{fname.split('/')[-1]}.json"
            with open(outf, "wb") as f:
                f.write(ujson.dumps(chunks.translate()).encode())

    json_list = fs.glob("*.json")

    # Merge to one file
    mzz = MultiZarrToZarr(
        json_list,
        coo_map={"time": "cf:time"},
        identical_dims=["lat", "lon"],
        remote_protocol="file"
    )

    with open("era5_tasmax_2003-2004.json", "wb") as f:
        f.write(ujson.dumps(mzz.translate()).encode())

    ds_merged = xr.open_dataset("reference://", engine="zarr", backend_kwargs={"consolidated": False, "storage_options": {"fo": "era5_tasmax_2003-2004.json"}})

    original_sel = ds2004.sel(time="2004-12-08").tasmax.compute()
    merged_sel = ds_merged.sel(time="2004-12-08").tasmax.compute()

    print(original_sel)
    print(merged_sel)

I am using the following libraries:

# packages in environment at <....>\test_kerchunk:
#
# Name                    Version                   Build  Channel
asciitree                 0.3.3                      py_2    conda-forge
blosc                     1.21.5               hdccc3a2_0    conda-forge
bzip2                     1.0.8                hcfcfb64_5    conda-forge
ca-certificates           2024.2.2             h56e8100_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
certifi                   2024.2.2           pyhd8ed1ab_0    conda-forge
cftime                    1.6.3           py311h59ca53f_0    conda-forge
entrypoints               0.4                pyhd8ed1ab_0    conda-forge
fasteners                 0.17.3             pyhd8ed1ab_0    conda-forge
fsspec                    2024.2.0           pyhca7485f_0    conda-forge
h5py                      3.10.0          nompi_py311h7195302_101    conda-forge
hdf4                      4.2.15               h5557f11_7    conda-forge
hdf5                      1.14.3          nompi_h73e8ff5_100    conda-forge
intel-openmp              2024.0.0         h57928b3_49841    conda-forge
kerchunk                  0.2.3              pyhd8ed1ab_0    conda-forge
krb5                      1.21.2               heb0366b_0    conda-forge
libaec                    1.1.2                h63175ca_1    conda-forge
libblas                   3.9.0              21_win64_mkl    conda-forge
libcblas                  3.9.0              21_win64_mkl    conda-forge
libcurl                   8.5.0                hd5e4a3a_0    conda-forge
libexpat                  2.5.0                h63175ca_1    conda-forge
libffi                    3.4.2                h8ffe710_5    conda-forge
libhwloc                  2.9.3           default_haede6df_1009    conda-forge
libiconv                  1.17                 hcfcfb64_2    conda-forge
libjpeg-turbo             3.0.0                hcfcfb64_1    conda-forge
liblapack                 3.9.0              21_win64_mkl    conda-forge
libnetcdf                 4.9.2           nompi_h07c049d_113    conda-forge
libsqlite                 3.45.1               hcfcfb64_0    conda-forge
libssh2                   1.11.0               h7dfc565_0    conda-forge
libxml2                   2.12.5               hc3477c8_0    conda-forge
libzip                    1.10.1               h1d365fa_3    conda-forge
libzlib                   1.2.13               hcfcfb64_5    conda-forge
lz4-c                     1.9.4                hcfcfb64_0    conda-forge
mkl                       2024.0.0         h66d3029_49657    conda-forge
msgpack-python            1.0.7           py311h005e61a_0    conda-forge
netcdf4                   1.6.5           nompi_py311he019f65_100    conda-forge
numcodecs                 0.11.0          py311h12c1d0e_1    conda-forge
numpy                     1.26.4          py311h0b4df5a_0    conda-forge
openssl                   3.2.1                hcfcfb64_0    conda-forge
packaging                 23.2               pyhd8ed1ab_0    conda-forge
pandas                    2.2.1           py311hf63dbb6_0    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
pthreads-win32            2.9.1                hfa6e2cd_3    conda-forge
python                    3.11.8          h2628c8c_0_cpython    conda-forge
python-dateutil           2.9.0              pyhd8ed1ab_0    conda-forge
python-tzdata             2024.1             pyhd8ed1ab_0    conda-forge
python_abi                3.11                    4_cp311    conda-forge
pytz                      2024.1             pyhd8ed1ab_0    conda-forge
setuptools                69.1.1             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.10               hfb803bf_0    conda-forge
tbb                       2021.11.0            h91493d7_1    conda-forge
tk                        8.6.13               h5226925_1    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
ucrt                      10.0.22621.0         h57928b3_0    conda-forge
ujson                     5.9.0           py311h12c1d0e_0    conda-forge
vc                        14.3                hcf57466_18    conda-forge
vc14_runtime              14.38.33130         h82b7239_18    conda-forge
vs2015_runtime            14.38.33130         hcb4865c_18    conda-forge
wheel                     0.42.0             pyhd8ed1ab_0    conda-forge
xarray                    2024.2.0           pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h8d14728_0    conda-forge
zarr                      2.17.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               hcfcfb64_5    conda-forge
zstd                      1.5.5                h12be248_0    conda-forge
JoelJaeschke commented 9 months ago

Answering my own question in part: by changing the chunksize of the time domain to 1, everything suddenly works as expected. My guess is that since the chunksize of 31 does not evenly divide the 365/366 days of a year, there is something going wrong with the created zarr file such that offsets from chunks are calculated weirdly? I have absolutely no idea about the internals of zarr, kerchunk and whatever else is involved, but I will leave this open for now as I am curious what may be the culprit.

Interestingly, a different dataset which has spatial chunking and time chunksize of 1 works just fine (even though in this case the spatial chunks divide the dimensions evenly), so I suspect it may be down to chunksize not properly dividing dimension sizes?

martindurant commented 9 months ago

If I understand correctly, you are hitting the limitation with zarr: all chunks must have the same dimensions (except the last in any given axis). This means that, if each constituent dataset has an incomplete last chunk, you cannot combine them. ZEP003 is my effort to address this, and I have a POC implementation, but no movement yet on getting it into zarr-python.

JoelJaeschke commented 9 months ago

Hey @martindurant Thanks a lot for your input! I guess now that I am thinking about it, the limitation does make sense to some extent as that was likely not what zarr was designed for initially.

Is this documented somewhere in the kerchunk docs? If not, I would be happy to add that as I consider it something of a footgun if you are actually in control of creating the datasets yourself and may be good knowing about?