Open norlandrhagen opened 1 year ago
TiffToZarr does indeed work, but because the data is actually just an array rather than a group of arrays, it is not valid input for xarray. Zarr opens it just fine, including the attributes.
We could have TiffToZarr produce zarr groups
--- a/kerchunk/tiff.py
+++ b/kerchunk/tiff.py
@@ -64,6 +64,9 @@ def tiff_to_zarr(urlpath, remote_options=None, target=None, target_options=None)
if isinstance(v, enum.EnumMeta):
meta[k] = v._name_
out[".zattrs"] = ujson.dumps(meta)
+ # make into group
+ out = {"data/" + k: v for k, v in out.items()}
+ out[".zgroup"] = '{"zarr_format": 2}'
if "GTRasterTypeGeoKey" in meta:
# TODO: make dataset and assign coords for geoTIFF
which makes it loadable by xarray. Whether the attributes belong to the group or the one array is another matter.
There is also code in kerchunk.tiff to make the X/Y coordinates, but we do NOT want to save these (each would be bigger than the original data file), we want to use xarray flexible indexes to generate them on demand at runtime. This would be a perfect project for you @norlandrhagen , if you are willing.
Whether the attributes belong to the group or the one array is another matter.
I see rioxarray puts these into a separate no-data variable called "spatial_ref"
Furthermore, you see that for the "test" dataset, you already have subdirectories, because this is a multiscale pyramid; I suppose this would be loaded by xarray-datatree?
Alternatively, we could implement the ability for xarray to open a single zarr array (rather than a group) using xr.open_dataarray
.
Kerchunk is happy to provide datasets that match xarray's expectations, which should provide faster turnaround in this kind of situation. Unless, of course, we are of the opinion that the restriction results in a dataset that doesn't accurately reflect the true nature of the data.
@norlandrhagen , were you planning to put my suggested fix into action?
There is no logical reason why Xarray should not be able to read a single Zarr. It can already read single array from a netCDF file (see open_dataarray).
Kerchunk is happy to provide datasets that match xarray's expectations, which should provide faster turnaround in this kind of situation
I agree that Kerchunk can provide a shim around almost any type of data by massaging it into a structure compatible with Xarray. That doesn't mean that is the best software architecture. This approach causes tons of weird special cases to be implemented in Kerchunk.
I would advocate for addressing this issue upstream. A "fast turnaround" is not always the most important factor. On behalf of the xarray core devs, we would be happy to accept a PR to implement this an review it in a timely manner.
@norlandrhagen has been working on handling the coordinates using Xarray, which would work with the strategy to read single Zarr arrays with Xarray.
On behalf of the xarray core devs, we would be happy to accept a PR to implement this an review it in a timely manner.
Do you have any expectations regarding the difficulty for this task? I'd be interested in helping out but don't want to overcommit.
Do you have any expectations regarding the difficulty for this task?
Since open_dataarray
currently calls open_dataset
under the hood, this would probably involve some non-trival refactoring of xarray's backend code. There are different design options that would need to be explored.
Perhaps 5 days of work for a dev already familiar with the xarray backend code? For someone new to xarray, probably much longer.
I have just had a look and come to the same conclusion. It may be possible to hack the code in ZarrStore.open_group to make an array look like a one-member group, but it would be ugly and I'm not immediately sure of how one might go about it without kerchunk-style indirection.
I guess some other backends also effectively make DataSet s out of arrays even when the backend format is not really hierarchical?
I guess some other backends also effectively make DataSet s out of arrays even when the backend format is not really hierarchical?
Rioxarray comes to mind here. They have defined a convention for how to take geotiffs and represent them as Xarray Datasets. IMO that would also fit better as a DataArray, not Dataset.
https://github.com/carbonplan/xrefcoord implements @martindurant's idea to generate the coordinates on demand. https://projectpythia.org/kerchunk-cookbook/notebooks/case_studies/GeoTIFF_FMI.html shows this in action. Any feedback would be welcome!
Hi there @martindurant,
I've started playing around with Kerchunk's
tiff_to_zarr
functionality and ran into an issue when trying to open up the reference file. In short, thetiff_to_zarr
works successfully, but I get the error:ContainsArrayError: path '' contains an array
. Hopefully this is a simple user error! I've successfully usedtiff_to_zarr
on thekerchunk/tests/lcmap_tiny_cog_2019.tif
and this Arctic DEM without incident.Thanks in advance!
I've included a code snippet below and a example of the generated reference file.
Generation script:
examples of refs for both datasets:
Hansen:
Kerchunk test tiff:
repr's of both ref datasets:
Traceback: