Open matt-long opened 5 years ago
The esmlab.regrid function failed.
I am curious to know what kind of error (MemoryError, etc) or is it just too slow?
Pretty sure it was a memory error, but I don't recall the specific message. I had to use several nodes to get over the memory hurdle with MPI.
Per xesmf documentation: https://xesmf.readthedocs.io/en/latest/limitations.html
xESMF currently only runs in serial. Parallel options are being investigated.
https://github.com/JiaweiZhuang/xESMF/issues/3
I just found about it
We are currently using xESMF, but don't have to. ESMPy does support MPI: http://www.earthsystemmodeling.org/esmf_releases/last_built/esmpy_doc/html/examples.html?highlight=mpi
though it's not clear how to integrate with dask.
though it's not clear how to integrate with dask.
Introducing MPI, ESMPy's complicated interface :) , integrating these with Xarray and Dask would definitely be a conundrum.
I am curious, what is the highest priority for esmlab-regrid? Is it usability? Performance? Do we want users to be able to perform regridding with one line of code? Because if usability is not the highest priority, it would be worth looking into MPI and ESMPy functionality
It looks like Dask's folks are looking into this kind of workflow: Running Dask and MPI programs together an experiment
@matt-long, Correct me if I'm wrong. This kind of parallelism is only needed when generating the weights. Once you have the weights, you don't need ESMPy/MPI machinery anymore. To apply the weights which is a matrix multiplication would be done without this heavy machinery, and this could be achieved with Scipy/Dask/Xarray, right?
I think our focus should remain on an end-to-end workflow and usability in the near term, but keep performance thru parallelism on the radar.
We could consider prototyping an MPI implementation as a standalone script, analogous to that shown here.
@andersy005, you are correct. The weights files are sparse matrices and are handled well by scipy.sparse.
@matt-long, was the work you were doing to generate WEIGHT_FILE=/glade/work/mclong/esmlab-regrid/etopo1_to_POP_tx0.1v3_conservative.nc
connected to the content of this notebook https://gist.github.com/matt-long/87630e97dc787ffc27b33e944dcd1473 ?
Yes
Since you are not using xesmf and ESMF/ESMPy, and the code deals with raw NumPy, I was thinking of exploring some optimization with numba and dask. Do you see any value in this or am I missing anything before I end up going down a rabbit hole :) ?
By "connected" I mean that that code was used in the same project. It does not compute the weight files, but rather only the grid file. It's fast enough as is, I'd say. Not a high priority for optimization.
By "connected" I mean that that code was used in the same project. It does not compute the weight files, but rather only the grid file.
Good point. Does this mean that the failing component is _gen_weights
method?
Yes.
Thank you for the clarification! Speaking of high priority, is there anything on your plate I can help with? :)
Not sure if related to JiaweiZhuang/xESMF#29. Parallel weight generation is very hard (if possible at all) to rewrite in a non-MPI way. But after the weights are generated, applying them to data using dask is much easier.
My plan is to clearly separate between "weight generation" and "weight application" phases:
Such separation will be much clearer after resolving JiaweiZhuang/xESMF#11. My plan is to have a "mini-xesmf" installation that doesn't depend on ESMPy -- it will just construct a complete regridder from existing weight files, generated from a ESMPy program running elsewhere (potentially a huge MPI run, potentially with a xesmf wrapper for better usability).
I recently wanted to generate weights to map ETOPO1 (1-minute data) to 0.1° POP. The esmlab.regrid function failed.
I resorted to running
ESMF_RegridWeightGen
in MPI on 12 Cheyenne nodes.