Open bouweandela opened 1 year ago
I looked into fixing this, but would like to get some input on what the desired chunks are. Input anyone? @SciTools/iris-devs
Some options:
1) Iris default chunks (example in #5712)
1) Dask default chunks
1) The chunks of the cube data variable
1) Offer some control to the user over what the chunks are, e.g. through the (currently NetCDF specific) iris.fileformats.netcdf.loader.CHUNK_CONTROL
@SciTools/peloton
We're concerned that this change might detriment other workflows which were otherwise fine. 3.8, which is due to be released soon (https://github.com/SciTools/iris/discussions/5363), includes a CHUNK_CONTROL context manager (https://scitools-iris.readthedocs.io/en/latest/further_topics/netcdf_io.html#chunk-control) which, when used to chunk the original coordinates, should help avoid this issue.
Are you referring to the example implementation in https://github.com/SciTools/iris/pull/5712?
An alternative could be to assign chunks to the input coordinates of the derivation at load time, such that the derived variable ends up with reasonably sized chunks. The input coordinates then have rather small chunks, which will be inconvenient for anyone who wants to work with those directly, but maybe that is not a very common scenario so not really an issue.
I prefer to fix this issue on the Iris side and not leave it to the user, as the current behaviour is unlikely to produce working results.
When loading a file that contains an auxiliary coordinate that can be computed using a formula term, the auxiliary coordinate ends up having huge chunks. This leads to memory issues when trying to use such coordinates, as the Dask workers will run out of memory and get killed.
Example
Open the file clw_Amon_FGOALS-f3-L_historical_r1i1p1f1_gr_196001-196912.nc and list the chunks of the computed coordinate:
i.e. no chunking is applied along the time dimension at all and the
'air_pressure'
coordinate has a chunk size of 1.5 GB. For performance it would be best if the coordinate had the same chunks as the data of the cube.ncdump -hs
of the file: