Open jedwards4b opened 1 year ago
Noting that a similar performance difference is seen between perlmutter and chrysalis (an AMD machine with infiniband) for E3SM cases. (Haven't tried the exact case above yet).
I just tried the X case on derecho with the cray compiler and I am not seeing the poor performance - rearrange_rearr max 46.8 min 40.4 (cray compiler 15.0.1) max 642.257 min 445.713 (intel compiler 2023.0.0)
Is the mpi library different?
It's the same mpi library, cray-mpich/8.1.25, however I note that there is a different build of this library for each compiler flavor.
Updating this issue: some hardware updates on NERSC made a lot of the observed behavior go away. @ndkeen can say more.
On Sep28th maintenance, there were some updates (BIOS, network, SW). And indeed I see improvements in several places -- mostly in communication at higher node counts on pm-cpu.
Where, in the plot, c1 refers to normal/default PSTRID of 1, and c8 is the work-around we were using of CPL_PSTRID=8.
Using the cesm model in a coupler test configuration PFS.ne120_t12.2000_XATM_XLND_XICE_XOCN_XROF_SGLC_SWAV.derecho_intel We are observing very poor performance of mct_rearrange_rearr on machines perlmutter (NERSC) and derecho (NCAR) - both machines use slingshot11 network and AMD processor.
Using 512 tasks on derecho with gptl timing we see
"mct_rearrange_rearr" - 512 512 4.426752e+06 1.391128e+05 277.198 ( 268 0) 263.345 ( 505 0)
Comparing to the ncar cheyenne system:
"mct_rearrange_rearr" - 512 512 4.426752e+06 3.399975e+04 73.911 ( 414 0) 60.767 ( 384 0)