EcohydrologyTeam / ClearWater-modules

A collection of water quality and vegetation process simulation modules designed to couple with water transport models.
MIT License
5 stars 0 forks source link

Reduce memory usage: lazy writing to `dask.array` #69

Open aufdenkampe opened 4 months ago

aufdenkampe commented 4 months ago

Splitting out from #62 this other approach to managing memory, to address the running-out-of-memory issue described by https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1833765631, complementing:

Dask Array

We also want to consider using dask.array within Xarray, which under-the-hood is just a chunked collection of numpy.ndarray objects. What this gets us is the ability to handle arrays that are larger than memory, by just accessing certain chunks at a time (i.e. by timestep). These docs (at the bottom of the section) provide an explanation how lazy writing works:

Once you’ve manipulated a Dask array, you can still write a dataset too big to fit into memory back to disk by using to_netcdf() in the usual way... By setting the compute argument to False, to_netcdf() will return a dask.delayed object that can be computed later.

from dask.diagnostics import ProgressBar

delayed_obj = ds.to_netcdf("manipulated-example-data.nc", compute=False)

with ProgressBar():
    results = delayed_obj.compute()

NOTE: We can use the Dataset.to_zarr() method the same way.

The solutions near the bottom of this thread describes a smart approach to do exactly what we need. Let's implement something similar (see the suggested code lines in response 8): https://discourse.pangeo.io/t/processing-large-too-large-for-memory-xarray-datasets-and-writing-to-netcdf/1724/8

aufdenkampe commented 1 month ago

It looks like Dask just introduced another way to reduce memory overhead: Shuffling large data at constant memory in Dask

This approach may not be necessary once we implement lazy-writing (as described in the intro comment to this issue). Also, it's possible that this approach only works on dataframes that use arrow dtypes, which aren't yet supported for multidimensional arrays. So it might not be easy to implement at this time. Sharing regardless, to plant seeds for this and/or similar future work.