Closed DamienIrving closed 3 years ago
Some introductory notes can be found at this post on Speeding Up Your Code
One option might be to have people login to http://pangeo.pydata.org and then do one of the examples from https://github.com/pangeo-data/pangeo-example-notebooks by cloning that repo in the jupyter terminal.
(To get a notebook rather than jupyter lab environment you need to replace lab
with tree
in the URL, e.g. http://pangeo.pydata.org/user/damienirving/tree
)
Resources: This NCI notebook from Kate Snow introduces chunking. This tutorial from Scott Wales (see recording) introduces more advanced dask usage.
Possible outline:
Lazy loading, subsetting, intermediate files, looping over depth slices (for instance).
The metadata of an xarray DataArray loaded with open_mfdataset
includes the dask chunk size.
The file itself may also be chunked. Filesystem chunking is available in netCDF-4 and HDF5 datasets. CMIP6 data should all be netCDF-4 and include some form of chunking on the file.
You can look at the .encoding
attribute of an xarray variable to see information about the file storage.
Accessing data across chunks is slower than along chunks.
Optimal chunk sizes:
In the notebook:
from dask.distributed import Client
c = Client()
c
From within a script:
import dask.distributed
if __name__ == '__main__':
client = dask.distributed.Client(
n_workers=8, threads_per_worker=1,
memory_limit='4gb', local_dir=tempfile.mkdtemp())
Check if a function is dask aware by watching the progress bar:
import dask.diagnostics
dask.diagnostics.ProgressBar().register()
Use the dask map_overlap
and map_blocks
to make your functions dask aware.
People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.
Some lesson content on Dask would be helpful here.