Open JoelJaeschke opened 9 months ago
Answering my own question in part: by changing the chunksize of the time domain to 1, everything suddenly works as expected. My guess is that since the chunksize of 31 does not evenly divide the 365/366 days of a year, there is something going wrong with the created zarr file such that offsets from chunks are calculated weirdly? I have absolutely no idea about the internals of zarr, kerchunk and whatever else is involved, but I will leave this open for now as I am curious what may be the culprit.
Interestingly, a different dataset which has spatial chunking and time chunksize of 1 works just fine (even though in this case the spatial chunks divide the dimensions evenly), so I suspect it may be down to chunksize not properly dividing dimension sizes?
If I understand correctly, you are hitting the limitation with zarr: all chunks must have the same dimensions (except the last in any given axis). This means that, if each constituent dataset has an incomplete last chunk, you cannot combine them. ZEP003 is my effort to address this, and I have a POC implementation, but no movement yet on getting it into zarr-python.
Hey @martindurant Thanks a lot for your input! I guess now that I am thinking about it, the limitation does make sense to some extent as that was likely not what zarr was designed for initially.
Is this documented somewhere in the kerchunk docs? If not, I would be happy to add that as I consider it something of a footgun if you are actually in control of creating the datasets yourself and may be good knowing about?
Hey π, first of all, thanks for this awesome project! It really makes working with large collections of data so much easier and I greatly appreciate the effort!
Unfortunately, I am currently facing an issue using kerchunk for creating an aggregated view over ERA5 reanalysis data. I have attached a script and two files to serve as a minimal reproducer. I strongly suspect I am using kerchunk wrong at the moment, however, I had a hard time figuring out what I am doing wrong exactly.
The minimal example contains two netCDF files for maximum daily temperature (tasmax, internally chunked/stored as (31, 2, 2) for (time, lat, lon) dimensions, respectively) and I am following the example from the Quickstart. I create single-file references using
SingleHdf5ToZarr
and then merge them usingMultiZarrToZarr
. However, when I then open my merged file, the values for 2003 are correct, but when looking at the values for 2004, they appear to be shifted or even missing (values after 2004-12-08 are all NaN, even though they are present in the original file).There is one place where I am deviating from the example code, namely using
coo_map={"time": "cf:time"}
instead ofconcat_dims=["time"]
. When I use the latter, what happens is that my time dimension ranges from 2003-01-01 to 2004-01-01. I suspect this is happening as both source files have time unitsdays since <year>-01-01
, meaning that the time dimension will look likerange(0, 365)
andrange(0,366)
for 2003 and 2004 (leap year), respectively. Therefor, if not parsing this axis using cftime right away, only the underlying integer values (days since datum) will be assigned, thus causing my time dimension to only have 366 entries. However, when I convert the time to be aligned to the same start datum, i.e. both years have same units ofdays since 2003-01-01 00:00:00
, I can useconcat_dims=["time"]
right away, however, my underlying issue still exists, so I suppose this is not the problem.I would greatly appreciate if you could point me to a direction for solving my problem?
reproducer.zip
Code from reproducer:
I am using the following libraries: