Open jhamman opened 4 years ago
In https://github.com/pangeo-data/pangeo/issues/334#issuecomment-404393988, I got Dask-distributed working on a synthetic regridding problem (just a sparse matrix multiply) by using dask.array.map_blocks
. Just switching to the distributed scheduler generally won't work, due to the relatively large size of regridding weights (often > 100 MB), which takes very long to serialize and often kills the cluster, as detailed here.
To make Dask-distributed work, there needs to be an explicit call to broadcast the weights to all workers: weights_future = client.scatter(weights, broadcast=True)
. And then pass the future as an additional argument: da.map_blocks(apply_weights, input_data, weights_future, ...)
. See the full code in this notebook.
I haven't tested dask-distributed on xarray DataArray/Dataset yet, and your error might be due to a different issue associated with xarray metadata other than dask. A quick way to test this is to pass the raw data instead, i.e. regridder(ds['air'].data)
, since xesmf also works on pure dask arrays.
Hmm, with regridder(ds['air'].data)
, I got ValueError: ctypes objects containing pointers cannot be pickled
and the end of this long trace:
Aha, problem solved. Just set this before applying the regridder to data:
regridder._grid_in = None
regridder._grid_out = None
regridder._grid_in
was to linked to ESMF objects that involve f2py
and ctypes
, and Dask was having trouble pickling it. In the next version I will make sure that the Regridder
class does not refer to any ESMF objects.
The explicit broadcasting of regridding weights is still TBD, though.
In https://github.com/pangeo-data/pangeo/issues/334#issuecomment-404003411, I was thinking about adding an explicit regridder.set_distributed(client=client)
call, to send the weights to all worker nodes. Then, regridder.weights
will become a Dask future pointing to the distributed weights on all workers.
Or maybe there is a cleverer way to hide this explicit call from users. Any suggestions & PRs are extremely welcome!
@JiaweiZhuang in your example notebook that you include above, what versions of xESMF and dask distributed are you using? I am still getting the ...cannot be pickled...
error when I replicate your sample workflow.
Previous issues have discussed supporting dask enabled parallel regridding (e.g. #3). This seems to be working for the threaded scheduler but not for the distributed scheduler. It seems like this should be doable at this point with some work to solve some serialization problems.
Current behavior
If you run the current dask regridding example in this repo's binder setup with dask-distributed, you get a bunch of serialization errors:
From what I can tell, it seems like there is some object that dask is trying to serialize that can't be pickled. Has anyone looked into this to diagnose why this is happening?