Open angus-g opened 4 years ago
From an email thread with @jmunroe and @aekiss
When discussing chunking, xarray and netCDF it is very confusing because there is on-disk (netCDF file level) chunking and xarray (dask) chunking. The netCDF chunk size is going to have some optimum size based on lustre which I don't know or understand at pretty much any level beyond hand-waving.
The xarray (dask) chunking will be what affects the size of the task-graph. Currently the cookbook library automatically matches the dask chunking to the on-disk chunking because it is difficult to think of a better default.
It does make some sense to have the netCDF chunk size smaller than what might be the optimum size for xarray calculations, as different calculations might like different xarray chunking along select dimensions (think time series analysis vs spatial analysis), and if the netCDF chunk size were large in all dimensions this would lead to a lot of unnecessary IO, unless dask/xarray does some nifty caching or can re-order ops based on IO access patterns. I know less than nothing about that.
Looking into the interplay of on-disk/dask chunking and auto
chunking is a great use case of standard analyses cf #210
Related #194
We currently save the on-disk chunk sizes in the database, and tell
xarray
to use these when loading data. @jmunroe pointed out that these chunks can be quite small, leading to a large number of tasks in the graph for many computations. Some thoughts:xarray
withauto
chunksizes (I'm not sure if this takes the on-disk chunk boundaries into account, which would lead to rechunking churn)