E3SM-Project / scorpio

A high-level Parallel I/O Library for structured grid applications
Other
21 stars 16 forks source link

Simplify PIOc_inq_type with less MPI_Bcast calls for PnetCDF IO type #493

Closed dqwu closed 1 year ago

dqwu commented 1 year ago

To inquire the type size with PnetCDF IO type, we do not need any communication across processes. All tasks can directly call the internal helper function pioc_pnetcdf_inq_type to get the results.

This PR simplifies PIOc_inq_type for PnetCDF IO type, reducing the number of collective MPI_Bcast calls in SCORPIO.

It also fixes two potential issues:

dqwu commented 1 year ago

Flooding can occur when the MPI library makes no control flow, and the root process calls MPI_Bcast many times in a row, generating a lot of unexpected messages on the other ranks and hence causing all kind of problems (memory consumption, slowdown, hanging, ...).

A specific E3SM MMF case (compset F2010-MMF1, res ne30pg2_ne30pg2) run on NERSC machine Perlmutter with 128 tasks (16 compute nodes) has a confirmed hanging issue with MPI_Bcast flooding.

Some environment variables like OMPI_MCA_coll_sync_barrier_before (Open MPI) or MPICH_COLL_SYNC (Cray MPI) can automatically add MPI_Barrier before each MPI_Bcast call.

For that case run, there are 136,893 MPI_Bcast calls inside SCORPIO and this PR can reduce that number to 121,008 (11.6% reduction). It can also avoid hanging without using MPICH_COLL_SYNC (not always working for some other E3SM cases, though).

worleyph commented 1 year ago

Note that this issue has arisen numerous times over many years, and motivated the inclusion of the explicit flow control logic in pio_swapm, for example.

rljacob commented 1 year ago

Pat is correct. This is a known problem and Pat's solution has been used in many places in E3SM (coupler, EAM). I don't think there's a C/C++ version?

worleyph commented 1 year ago

Jim Edwards reimplemented the fortran routines in SCORPIO_classic in C in the current SCORPIO code. I imagine that it is still being used for gathers and in the box_rearranger. Broadcasts are more iffy, though swapm could still be used. Need to use the throttle controls here as the handshaking messages could swamp the source even more than the destination processes are swamped without using flow control. Just adding barriers periodically might be the easiest solution for now, while other options are explored.

rljacob commented 1 year ago

I went looking: MCT: https://github.com/MCSclimate/MCT/blob/master/mct/m_SPMDutils.F90#L620 EAM: https://github.com/E3SM-Project/E3SM/blob/master/components/eam/src/utils/spmd_utils.F90#L301

I guess that's it.

worleyph commented 1 year ago

Away from my computer, but look in pio_spmd.c in scorpio/src/C (or something like that) for what Jim imported. Would have to look at spmd_utils to see if there was anything there that would be useful that has not already been imported into scorpio.

Sent from my iPhone

On Nov 10, 2022, at 10:45, Robert Jacob @.***> wrote:

 I went looking: MCT: https://github.com/MCSclimate/MCT/blob/master/mct/m_SPMDutils.F90#L620 EAM: https://github.com/E3SM-Project/E3SM/blob/master/components/eam/src/utils/spmd_utils.F90#L301

I guess that's it.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

rljacob commented 1 year ago

Thanks for the reminder, Pat. SCORPIO: https://github.com/E3SM-Project/scorpio/blob/master/src/clib/pio_spmd.c#L76 SCORPIO-classic: https://github.com/E3SM-Project/scorpio/blob/scorpio_classic/pio/pio_spmd_utils.F90.in#L88

dqwu commented 1 year ago

@worleyph @rljacob To be clear, we are just trying to optimize the specific SCORPIO function PIOc_inq_type in this PR, and it is not related to flow control. For inquiring the type size with PnetCDF IO type, we do not need any comm across procs, and this can reduce the number of collective MPI_Bcast calls in SCORPIO.

The hanging issue (related to MPI_Bcast calls) on Perlmutter is discussed in https://github.com/E3SM-Project/scream/issues/1920

worleyph commented 1 year ago

My comment was just for background, in case you were not familiar with the existing infrastructure to deal with this issue. Sounds like this PR is a clear win. If you feel that my comment would be useful in https://github.com/E3SM-Project/scream/issues/1920, please move it there , or I can. Or just keep it in mind in the future.