carpentries-lab / python-aos-lesson

Python for Atmosphere and Ocean Scientists
https://carpentries-lab.github.io/python-aos-lesson/
Other
87 stars 48 forks source link

Add content on dealing with large arrays? #8

Closed DamienIrving closed 3 years ago

DamienIrving commented 6 years ago

People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.

Some lesson content on Dask would be helpful here.

DamienIrving commented 6 years ago

Some introductory notes can be found at this post on Speeding Up Your Code

DamienIrving commented 6 years ago

One option might be to have people login to http://pangeo.pydata.org and then do one of the examples from https://github.com/pangeo-data/pangeo-example-notebooks by cloning that repo in the jupyter terminal.

(To get a notebook rather than jupyter lab environment you need to replace lab with tree in the URL, e.g. http://pangeo.pydata.org/user/damienirving/tree)

DamienIrving commented 5 years ago

Resources: This NCI notebook from Kate Snow introduces chunking. This tutorial from Scott Wales (see recording) introduces more advanced dask usage.

Possible outline:

0. Simple things you can do

Lazy loading, subsetting, intermediate files, looping over depth slices (for instance).

1. Introduction to chunking

Dask chunking

The metadata of an xarray DataArray loaded with open_mfdataset includes the dask chunk size.

File chunking

The file itself may also be chunked. Filesystem chunking is available in netCDF-4 and HDF5 datasets. CMIP6 data should all be netCDF-4 and include some form of chunking on the file.

You can look at the .encoding attribute of an xarray variable to see information about the file storage.

2. Chunking best practices

Accessing data across chunks is slower than along chunks.

Optimal chunk sizes:

3. Parallelising your code

In the notebook:

from dask.distributed import Client
c = Client()
c

From within a script:

import dask.distributed

if __name__ == '__main__':
    client = dask.distributed.Client(
        n_workers=8, threads_per_worker=1,
        memory_limit='4gb', local_dir=tempfile.mkdtemp())

4. Rolling your own dask aware functions

Check if a function is dask aware by watching the progress bar:

import dask.diagnostics
dask.diagnostics.ProgressBar().register()

Use the dask map_overlap and map_blocks to make your functions dask aware.