JiaweiZhuang / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
269 stars 49 forks source link

Invoking xesmf with mpirun #79

Closed nichannah closed 4 years ago

nichannah commented 4 years ago

I am using mpirun to run my Python program across multiple nodes in a cluster. Each instance of the program uses MPI to determine it's own rank and the number of processes but nothing else. Each program also uses xESMF to do some regridding.

The problem is that the underlying ESMF library then tries to decompose the regridding task across the ranks. xESMF does not handle this and will have an error.

Since xESMF does not support parallel regridding (yet) - is there a way to ensure that the underlying library does not try to do this?

Any thoughts or work-around ideas would be much appreciated.

JiaweiZhuang commented 4 years ago

Each instance of the program uses MPI to determine it's own rank and the number of processes but nothing else.

So you don't need any inter-process communication using mpi4py? In that case I would suggest not using mpirun to launch your python script, but using a scheduler feature like Slurm Job Array Support and get your job ID via os.environ['SLURM_ARRAY_TASK_ID'].

Since xESMF does not support parallel regridding (yet)

Parallel weight construction is not supported yet, but the weights can be apply in parallel via Dask. See a long discussion at #3.

Does your use case actually need MPI-style parallelization? If the data can be chunked in vertical/time dimension, Dask should be sufficient. Any reason for having to chunk in the horizontal?

nichannah commented 4 years ago

@JiaweiZhuang thank you for the very quick reply and useful suggestions.

Yes that is correct, the only reason to use MPI-style parallelization is to launch across multiple nodes.

The SLURM suggestion is a good one, and this is what I'm doing on a cluster that has that installed. However I also need to get it running on a PBS cluster system which uses MPI to do task launching.

I have tried disconnecting the MPI communicator (comm.Disconnect()) after start-up but this seems to crash ESMF with a seg fault.

nichannah commented 4 years ago

OK, I think you've answered this. The best approach is probably to use job arrays. An alternative might be use ESMF compiled without mpi support.