geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] MAPL pFIO failure at high core counts #33

Closed lizziel closed 4 years ago

lizziel commented 5 years ago

GCHP 12.5.0 uses online ESMF regridding weights rather than external tile files for regridding. Due to ESMF domain decomposition rules this can result in a MAPL error if an input grid is too coarse for a run’s configured core count. We have seen this problem for 4°x5° input files when using >600 cores. GMAO has fixed this problem in a more recent version of MAPL. All 4°x5° input files are replaced with higher resolution files for GCHP 12.5.0 to avoid this issue. However, users may still run into problems if running with thousands of cores.

JiaweiZhuang commented 5 years ago

this can result in a MAPL error if an input grid is too coarse for a run’s configured core count.

Because the input grid cannot be horizontally-decomposed into too many tiles? So I guess the new MAPL uses all the cores to perform distributed regridding weights calculation (via ESMF)? For low-resolution input grid, the communication cost could dominate the computation cost. In this case, it is probably faster to run the regridding in serial or read weights from file...

Does 12.5.0 have an option to fall back to offline weight files?

lizziel commented 5 years ago

Because the input grid cannot be horizontally-decomposed into too many tiles? So I guess the new MAPL uses all the cores to perform distributed regridding weights calculation (via ESMF)?

Correct. 12.5.0 does not have an option to use offline weight files. My understanding is that the newer version of MAPL does have this option. If you would like to explore MAPL developments at GMAO check out https://github.com/GEOS-ESM/mapl.

lizziel commented 4 years ago

A bit more background on this issue, since there have been several questions about it. This is the explanation from Tom Clune (GMAO):

The model resolution is not the relevant parameter here. (Presuming I’ve correctly identified the problem.) Rather it is a constraint on the resolution of the input data in ExtData vs the number of cores it is chopped up into. But for the moment, I can explain the issue in the abstract:

Suppose that the source lat-lon resolution is Nx X Ny, and the number of cores is Npx X Npy. The decomposition is automatically determined by MAPL but essentially tries to make both equal to Sqrt(#pes), and is independent of how processes are split across nodes and independent of how the main model grid is decomposed. (There is a minor caveat, the algorithm knows that typically Nx ~ 2 x Ny, and thus aims for Npx/Npy to be nearly 2 as well.)

If everything divides evenly, then the ESMF requirement is that Nx/Npx >= 2 and Ny/Npy >= 2.

Concrete case Np = 384, Npx = 24 x 16. Then the lat lon source grid must be at least 48 x 32. For Np=576, MAPL probably choses 32x18 or 36x16, which means the grid must be at least 64x36 or 72x32.

OK - so now looking at the data, the failing Np=384 case exceeded memory, but otherwise 384 apparently works for all of your grids. Likewise, the one failing 576 case was for other reasons. The others all consistently failed. 576 is the smallest failing case, and thus I predict you have at least one input file that is coarser than 64x36 (or maybe 72x32). Can you confirm?

A workaround might be to either exclude those files or to interpolate them offline to something higher resolution until you can use a more recent MAPL that works around the ESMF limitation.

Our current approach is to regrid offending files to higher resolution in the GCHP 12.x series. GCHP 13.0.0 will fix this issue.

lizziel commented 4 years ago

I am closing this issue since it is fixed in the dev versions of GCHPctm 13.0.0 which uses MAPL 2.0.