Open aufdenkampe opened 9 months ago
It looks like Dask just introduced another way to reduce memory overhead: Shuffling large data at constant memory in Dask
This approach may not be necessary once we implement lazy-writing (as described in the intro comment to this issue). Also, it's possible that this approach only works on dataframes that use arrow dtypes, which aren't yet supported for multidimensional arrays. So it might not be easy to implement at this time. Sharing regardless, to plant seeds for this and/or similar future work.
Splitting out from #62 this other approach to managing memory, to address the running-out-of-memory issue described by https://github.com/EcohydrologyTeam/ClearWater-modules/issues/57#issuecomment-1833765631, complementing:
62
68
Dask Array
We also want to consider using
dask.array
within Xarray, which under-the-hood is just a chunked collection ofnumpy.ndarray
objects. What this gets us is the ability to handle arrays that are larger than memory, by just accessing certain chunks at a time (i.e. by timestep). These docs (at the bottom of the section) provide an explanation how lazy writing works:NOTE: We can use the
Dataset.to_zarr()
method the same way.The solutions near the bottom of this thread describes a smart approach to do exactly what we need. Let's implement something similar (see the suggested code lines in response 8): https://discourse.pangeo.io/t/processing-large-too-large-for-memory-xarray-datasets-and-writing-to-netcdf/1724/8