Investigate if it is possible to avoid reading all coordinate chunks when opening a dataset with xarray

abarciauskas-bgse commented 1 year ago

Right now, xarray's open_zarr and open_dataset are significantly slower when coordinates are chunked as all coordinate chunks result in a request to S3.

Is it possible to either

create a datastore where coordinates are not chunked
open a dataset which has chunked coordinates but not fetch all the chunks.

Note: I tried decode_coords=False and the same issue results.

From @maxrjones

a case in which the data are chunked along a dimension but the coordinates are not chunked. This is what we did for the CMIP6-downscaling pyramids to fetch the coordinates with one request but only fetch specific chunks of the data, e.g.,

import zarr
store = zarr.open("s3://carbonplan-cmip6/flow-outputs/results/0.1.9/pyramid/01df7816c64b3999/0/", mode="r")
print(f'tasmin chunks: {store["tasmin"].chunks}')
print(f'time chunks: {store["time"].chunks}')

tasmin chunks: (25, 128, 128)
time chunks: (1020,)

abarciauskas-bgse commented 1 year ago

I've opened an issue in pangeo-forge about this https://github.com/pangeo-forge/pangeo-forge-recipes/issues/554

But for now will probably use this code: https://github.com/developmentseed/tile-benchmarking/blob/feat/dont_chunk_coordinates/profiling/cmip6_zarr/rechunking.ipynb to produce zarr stores

abarciauskas-bgse commented 1 year ago

Closing this for now as the issue is with pangeo-forge and being worked on in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/556

developmentseed / tile-benchmarking

Investigate if it is possible to avoid reading all coordinate chunks when opening a dataset with xarray #34