fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
302 stars 78 forks source link

Issue using `tiff_to_zarr` #313

Open norlandrhagen opened 1 year ago

norlandrhagen commented 1 year ago

Hi there @martindurant,

I've started playing around with Kerchunk's tiff_to_zarr functionality and ran into an issue when trying to open up the reference file. In short, the tiff_to_zarr works successfully, but I get the error: ContainsArrayError: path '' contains an array. Hopefully this is a simple user error! I've successfully used tiff_to_zarr on the kerchunk/tests/lcmap_tiny_cog_2019.tif and this Arctic DEM without incident.

Thanks in advance!

I've included a code snippet below and a example of the generated reference file.

Generation script:


import fsspec
import xarray as xr
import ujson
import imagecodecs.numcodecs
imagecodecs.numcodecs.register_codecs()

from kerchunk.tiff import tiff_to_zarr

hansen_url = 'https://storage.googleapis.com/earthenginepartners-hansen/GFC-2021-v1.9/Hansen_GFC-2021-v1.9_gain_50N_130W.tif'
ex_path = 'https://github.com/fsspec/kerchunk/blob/main/kerchunk/tests/lcmap_tiny_cog_2019.tif?raw=true'

hansen_ds = xr.open_dataset(hansen_url,engine='rasterio')

ex_ds = xr.open_dataset(ex_path,engine='rasterio')

zt = tiff_to_zarr(hansen_url)
with open("hansen.json", "wb") as f:
    f.write(ujson.dumps(zt).encode())

m = fsspec.get_mapper("reference://", fo='hansen.json')
hansen_kerchunk_ds = xr.open_dataset(
    m, engine="zarr", backend_kwargs={"consolidated": False}
)

examples of refs for both datasets:

Hansen:

{
    ".zattrs": "{\"_ARRAY_DIMENSIONS\":[\"Y\",\"X\"],\"KeyDirectoryVersion\":1,\"KeyRevision\":1,\"KeyRevisionMinor\":0,\"GTModelTypeGeoKey\":2,\"GTRasterTypeGeoKey\":1,\"GeographicTypeGeoKey\":4326,\"GeogCitationGeoKey\":\"WGS 84\",\"GeogAngularUnitsGeoKey\":9102,\"GeogSemiMajorAxisGeoKey\":6378137.0,\"GeogInvFlatteningGeoKey\":298.257223563,\"ModelPixelScale\":[0.00025,0.00025,0.0],\"ModelTiepoint\":[0.0,0.0,0.0,-130.0,50.0,0.0]}",
    ".zarray": "{\n \"chunks\": [\n  1,\n  40000\n ],\n \"compressor\": {\n  \"id\": \"imagecodecs_lzw\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  40000,\n  40000\n ],\n \"zarr_format\": 2\n}",
    "0.0": ["https:\/\/storage.googleapis.com\/earthenginepartners-hansen\/GFC-2021-v1.9\/Hansen_GFC-2021-v1.9_gain_50N_130W.tif", 320378, 777],
    "1.0": ["https:\/\/storage.googleapis.com\/earthenginepartners-hansen\/GFC-2021-v1.9\/Hansen_GFC-2021-v1.9_gain_50N_130W.tif", 321155, 796], ...}

Kerchunk test tiff:


{
    ".zgroup": "{\n \"zarr_format\": 2\n}",
    ".zattrs": "{\"multiscales\":[{\"datasets\":[{\"path\":\"0\"},{\"path\":\"1\"},{\"path\":\"2\"}],\"metadata\":{},\"name\":\"\",\"version\":\"0.1\"}],\"OVR_RESAMPLING_ALG\":\"NEAREST\",\"LAYOUT\":\"IFDS_BEFORE_DATA\",\"BLOCK_ORDER\":\"ROW_MAJOR\",\"BLOCK_LEADER\":\"SIZE_AS_UINT4\",\"BLOCK_TRAILER\":\"LAST_4_BYTES_REPEATED\",\"KNOWN_INCOMPATIBLE_EDITION\":\"NO\",\"KeyDirectoryVersion\":1,\"KeyRevision\":1,\"KeyRevisionMinor\":0,\"GTModelTypeGeoKey\":1,\"GTRasterTypeGeoKey\":1,\"GTCitationGeoKey\":\"Albers\",\"GeographicTypeGeoKey\":4326,\"GeogCitationGeoKey\":\"WGS 84\",\"GeogAngularUnitsGeoKey\":9102,\"GeogSemiMajorAxisGeoKey\":6378140.0,\"GeogInvFlatteningGeoKey\":298.256999999996,\"ProjectedCSTypeGeoKey\":32767,\"ProjectionGeoKey\":32767,\"ProjCoordTransGeoKey\":11,\"ProjLinearUnitsGeoKey\":9001,\"ProjStdParallel1GeoKey\":29.5,\"ProjStdParallel2GeoKey\":45.5,\"ProjNatOriginLongGeoKey\":-96.0,\"ProjNatOriginLatGeoKey\":23.0,\"ProjFalseEastingGeoKey\":0.0,\"ProjFalseNorthingGeoKey\":0.0,\"ModelPixelScale\":[30.0,30.0,0.0],\"ModelTiepoint\":[0.0,0.0,0.0,-1801185.0,2700405.0,0.0]}",
    "0\/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"Y\",\n  \"X\"\n ]\n}",
    "0\/.zarray": "{\n \"chunks\": [\n  512,\n  512\n ],\n \"compressor\": {\n  \"id\": \"zlib\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  2048,\n  2048\n ],\n \"zarr_format\": 2\n}",
    "1\/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"Y1\",\n  \"X1\"\n ]\n}",
    "1\/.zarray": "{\n \"chunks\": [\n  128,\n  128\n ],\n \"compressor\": {\n  \"id\": \"zlib\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  1024,\n  1024\n ],\n \"zarr_format\": 2\n}",
    "2\/.zattrs": "{\n \"_ARRAY_DIMENSIONS\": [\n  \"Y2\",\n  \"X2\"\n ]\n}",
    "2\/.zarray": "{\n \"chunks\": [\n  128,\n  128\n ],\n \"compressor\": {\n  \"id\": \"zlib\"\n },\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  512,\n  512\n ],\n \"zarr_format\": 2\n}",
    "0\/0.0": ["https:\/\/github.com\/fsspec\/kerchunk\/blob\/main\/kerchunk\/tests\/lcmap_tiny_cog_2019.tif?raw=true", 114079, 13584],
    "0\/0.1": ["https:\/\/github.com\/fsspec\/kerchunk\/blob\/main\/kerchunk\/tests\/lcmap_tiny_cog_2019.tif?raw=true", 127671, 19626], ...}

repr's of both ref datasets:

image image

Traceback:

```bash --------------------------------------------------------------------------- ContainsArrayError Traceback (most recent call last) Cell In[49], line 2 1 m = fsspec.get_mapper("reference:[//](https://github.com/fsspec/kerchunk/issues/313)", fo='hansen.json') ----> 2 hansen_kerchunk_ds = xr.open_dataset( 3 m, engine="zarr", backend_kwargs={"consolidated": False} 4 ) File [~/opt/anaconda3/envs/install/envs/test_env/lib/python3.9/site-packages/xarray/backends/api.py:526](https://file+.vscode-resource.vscode-cdn.net/Users/nrhagen/Documents/carbonplan/global_forest_watch/~/opt/anaconda3/envs/install/envs/test_env/lib/python3.9/site-packages/xarray/backends/api.py:526), in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, backend_kwargs, **kwargs) 514 decoders = _resolve_decoders_kwargs( 515 decode_cf, 516 open_backend_dataset_parameters=backend.open_dataset_parameters, (...) 522 decode_coords=decode_coords, 523 ) 525 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None) --> 526 backend_ds = backend.open_dataset( 527 filename_or_obj, 528 drop_variables=drop_variables, 529 **decoders, 530 **kwargs, 531 ) 532 ds = _dataset_from_backend_dataset( 533 backend_ds, 534 filename_or_obj, ... -> 1442 raise ContainsArrayError(path) 1443 raise GroupNotFoundError(path) 1445 elif mode == 'w': ContainsArrayError: path '' contains an array ```
martindurant commented 1 year ago

TiffToZarr does indeed work, but because the data is actually just an array rather than a group of arrays, it is not valid input for xarray. Zarr opens it just fine, including the attributes.

We could have TiffToZarr produce zarr groups

--- a/kerchunk/tiff.py
+++ b/kerchunk/tiff.py
@@ -64,6 +64,9 @@ def tiff_to_zarr(urlpath, remote_options=None, target=None, target_options=None)
                     if isinstance(v, enum.EnumMeta):
                         meta[k] = v._name_
                 out[".zattrs"] = ujson.dumps(meta)
+    # make into group
+    out = {"data/" + k: v for k, v in out.items()}
+    out[".zgroup"] = '{"zarr_format": 2}'
     if "GTRasterTypeGeoKey" in meta:
         # TODO: make dataset and assign coords for geoTIFF

which makes it loadable by xarray. Whether the attributes belong to the group or the one array is another matter.

There is also code in kerchunk.tiff to make the X/Y coordinates, but we do NOT want to save these (each would be bigger than the original data file), we want to use xarray flexible indexes to generate them on demand at runtime. This would be a perfect project for you @norlandrhagen , if you are willing.

martindurant commented 1 year ago

Whether the attributes belong to the group or the one array is another matter.

I see rioxarray puts these into a separate no-data variable called "spatial_ref"

Furthermore, you see that for the "test" dataset, you already have subdirectories, because this is a multiscale pyramid; I suppose this would be loaded by xarray-datatree?

rabernat commented 1 year ago

Alternatively, we could implement the ability for xarray to open a single zarr array (rather than a group) using xr.open_dataarray.

martindurant commented 1 year ago

Kerchunk is happy to provide datasets that match xarray's expectations, which should provide faster turnaround in this kind of situation. Unless, of course, we are of the opinion that the restriction results in a dataset that doesn't accurately reflect the true nature of the data.

@norlandrhagen , were you planning to put my suggested fix into action?

rabernat commented 1 year ago

There is no logical reason why Xarray should not be able to read a single Zarr. It can already read single array from a netCDF file (see open_dataarray).

Kerchunk is happy to provide datasets that match xarray's expectations, which should provide faster turnaround in this kind of situation

I agree that Kerchunk can provide a shim around almost any type of data by massaging it into a structure compatible with Xarray. That doesn't mean that is the best software architecture. This approach causes tons of weird special cases to be implemented in Kerchunk.

I would advocate for addressing this issue upstream. A "fast turnaround" is not always the most important factor. On behalf of the xarray core devs, we would be happy to accept a PR to implement this an review it in a timely manner.

maxrjones commented 1 year ago

@norlandrhagen has been working on handling the coordinates using Xarray, which would work with the strategy to read single Zarr arrays with Xarray.

On behalf of the xarray core devs, we would be happy to accept a PR to implement this an review it in a timely manner.

Do you have any expectations regarding the difficulty for this task? I'd be interested in helping out but don't want to overcommit.

rabernat commented 1 year ago

Do you have any expectations regarding the difficulty for this task?

Since open_dataarray currently calls open_dataset under the hood, this would probably involve some non-trival refactoring of xarray's backend code. There are different design options that would need to be explored.

Perhaps 5 days of work for a dev already familiar with the xarray backend code? For someone new to xarray, probably much longer.

martindurant commented 1 year ago

I have just had a look and come to the same conclusion. It may be possible to hack the code in ZarrStore.open_group to make an array look like a one-member group, but it would be ugly and I'm not immediately sure of how one might go about it without kerchunk-style indirection.

I guess some other backends also effectively make DataSet s out of arrays even when the backend format is not really hierarchical?

rabernat commented 1 year ago

I guess some other backends also effectively make DataSet s out of arrays even when the backend format is not really hierarchical?

Rioxarray comes to mind here. They have defined a convention for how to take geotiffs and represent them as Xarray Datasets. IMO that would also fit better as a DataArray, not Dataset.

maxrjones commented 1 year ago

https://github.com/carbonplan/xrefcoord implements @martindurant's idea to generate the coordinates on demand. https://projectpythia.org/kerchunk-cookbook/notebooks/case_studies/GeoTIFF_FMI.html shows this in action. Any feedback would be welcome!