COSIMA / cosima-cookbook

Framework for indexing and querying ocean-sea ice model output.
https://cosima-recipes.readthedocs.io/en/latest/
Apache License 2.0
58 stars 25 forks source link

Possibly non-optimal chunk sizes? #213

Open angus-g opened 4 years ago

angus-g commented 4 years ago

We currently save the on-disk chunk sizes in the database, and tell xarray to use these when loading data. @jmunroe pointed out that these chunks can be quite small, leading to a large number of tasks in the graph for many computations. Some thoughts:

aidanheerdegen commented 4 years ago

From an email thread with @jmunroe and @aekiss

When discussing chunking, xarray and netCDF it is very confusing because there is on-disk (netCDF file level) chunking and xarray (dask) chunking. The netCDF chunk size is going to have some optimum size based on lustre which I don't know or understand at pretty much any level beyond hand-waving.

The xarray (dask) chunking will be what affects the size of the task-graph. Currently the cookbook library automatically matches the dask chunking to the on-disk chunking because it is difficult to think of a better default.

It does make some sense to have the netCDF chunk size smaller than what might be the optimum size for xarray calculations, as different calculations might like different xarray chunking along select dimensions (think time series analysis vs spatial analysis), and if the netCDF chunk size were large in all dimensions this would lead to a lot of unnecessary IO, unless dask/xarray does some nifty caching or can re-order ops based on IO access patterns. I know less than nothing about that.

Looking into the interplay of on-disk/dask chunking and auto chunking is a great use case of standard analyses cf #210

aidanheerdegen commented 4 years ago

Related #194