RhodiumGroup / rhg_compute_tools

Tools for using compute.rhg.com and compute.impactlab.org
MIT License
1 stars 3 forks source link

address strange memory blowup with rhg_compute_tools.xarray.dataarrays_from_delayed #84

Closed bolliger32 closed 3 years ago

bolliger32 commented 3 years ago

Gather dict(ds.coords) instead of ds.coords (see #83)

delgadom commented 3 years ago

This is awesome! Thanks @bolliger32 . Dug around a bit and it seems da.coords._data is a pointer to the original array, and that was getting shipped back to the notebook. dict(da.coords) solves the problem! Thanks for the tip!

delgadom commented 3 years ago

this would be cool to combine with the new xarray combine functions so you could combine based on coords or auto-combine. or just drop the dataarray and dataset from delayed functions and just provide dataarrays and datasets functions and point the users to these concat functions.

Workflow would just be:

futures = [ ... ]  # flat list of dataarray futures with arbitrary non-overlapping coordinate relationships
da = xr.combine_by_coords(rhgx.dataarrays_from_delayed(futures))

futures = [[...], [...], ...] # nested list of datarrays with hierarchical structures
da = xr.combine_nested(rhgx.dataarrays_from_delayed(futures))

# or even, if you want terrible performance and just don't care...
futures = [ ... ]  # ordered flat list of dataarray futures with overlapping coordinate relationships
da = functools.reduce(lambda x, y: x.combine_first(y), rhgx.dataarrays_from_delayed(futures))
bolliger32 commented 3 years ago

This is awesome! Thanks @bolliger32 . Dug around a bit and it seems da.coords._data is a pointer to the original array, and that was getting shipped back to the notebook. dict(da.coords) solves the problem! Thanks for the tip!

@delgadom nice find! that was blowing my mind.

this would be cool to combine with the new xarray combine functions so you could combine based on coords or auto-combine. or just drop the dataarray and dataset from delayed functions and just provide dataarrays and datasets functions and point the users to these concat functions.

agreed. Maybe I'll create an issue just to have this on the backburner. If people start using these functions more frequently we can expose those functions in a user-friendly way.