developmentseed / tile-benchmarking

Repo for configuring datasets and tests for benchmarking with a dynamic tiler
https://developmentseed.org/tile-benchmarking
3 stars 0 forks source link

Investigate if it is possible to avoid reading all coordinate chunks when opening a dataset with xarray #34

Closed abarciauskas-bgse closed 1 year ago

abarciauskas-bgse commented 1 year ago

Right now, xarray's open_zarr and open_dataset are significantly slower when coordinates are chunked as all coordinate chunks result in a request to S3.

Is it possible to either

  1. create a datastore where coordinates are not chunked
  2. open a dataset which has chunked coordinates but not fetch all the chunks.

Note: I tried decode_coords=False and the same issue results.

Related: https://github.com/pydata/xarray/issues/6633 https://github.com/pydata/xarray/pull/7368 https://discourse.pangeo.io/t/puzzling-s3-xarray-open-zarr-latency/1074/11

From @maxrjones

a case in which the data are chunked along a dimension but the coordinates are not chunked. This is what we did for the CMIP6-downscaling pyramids to fetch the coordinates with one request but only fetch specific chunks of the data, e.g.,

import zarr
store = zarr.open("s3://carbonplan-cmip6/flow-outputs/results/0.1.9/pyramid/01df7816c64b3999/0/", mode="r")
print(f'tasmin chunks: {store["tasmin"].chunks}')
print(f'time chunks: {store["time"].chunks}')

tasmin chunks: (25, 128, 128)
time chunks: (1020,)
abarciauskas-bgse commented 1 year ago

I've opened an issue in pangeo-forge about this https://github.com/pangeo-forge/pangeo-forge-recipes/issues/554

But for now will probably use this code: https://github.com/developmentseed/tile-benchmarking/blob/feat/dont_chunk_coordinates/profiling/cmip6_zarr/rechunking.ipynb to produce zarr stores

abarciauskas-bgse commented 1 year ago

Closing this for now as the issue is with pangeo-forge and being worked on in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/556