ESMCI / cime

Common Infrastructure for Modeling the Earth
http://esmci.github.io/cime
Other
162 stars 207 forks source link

Fail in Cmip6 case with gnu on cheyenne with cime5.6.24 #3259

Closed ekluzek closed 5 years ago

ekluzek commented 5 years ago

I'm getting a fail in MPI running this case on cheyenne: SMS_Ld1.f19_g17.I1850Clm50BgcCropCmip6.cheyenne_gnu.clm-default. It fails with cime5.6.24, but it works with at least cime5.6.21. All other aux_clm tests work normally other than this one.

cesm.log file dies with:

> (seq_infodata_Init) 
>  read seq_infodata_inparm namelist from: drv_in
> (shr_orb_params) ------ Computed Orbital Parameters ------
> (shr_orb_params) Eccentricity      =   1.676429E-02
> (shr_orb_params) Obliquity (deg)   =   2.345928E+01
> (shr_orb_params) Obliquity (rad)   =   4.094416E-01
> (shr_orb_params) Long of perh(deg) =   1.003269E+02
> (shr_orb_params) Long of perh(rad) =   4.892627E+00
> (shr_orb_params) Long at v.e.(rad) =  -3.290978E-02
> (shr_orb_params) -----------------------------------------
> [r1i0n0:55376] *** An error occurred in MPI_Bcast
> [r1i0n0:55376] *** reported by process [58720257,35]
> [r1i0n0:55376] *** on communicator MPI_COMM_WORLD
> [r1i0n0:55376] *** MPI_ERR_COMM: invalid communicator
> [r1i0n0:55376] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [r1i0n0:55376] ***    and potentially your MPI job)
> [r1i0n0:55335] 35 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [r1i0n0:55335] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The coupler doesn't finish initialization and doesn't get to initialization of subcomponents. So cpl.log ends with...

WAV : pio_root = 1 WAV : pio_iotype = 6 8 MB memory alloc in MB is 8.00 8 MB memory dealloc in MB is 7.83 Memory block size conversion in bytes is 4017.53 (seq_flux_readnl_mct) : read seq_flux_mct_inparm namelist from: drv_in (seq_mct_drv) : ------------------------------------------------------------ (seq_mct_drv) : Common Infrastructure for Modeling the Earth (CIME) CPL7
(seq_mct_drv) : ------------------------------------------------------------ (seq_mct_drv) : (Online documentation is available on the CIME
(seq_mct_drv) : github: http://esmci.github.io/cime/)
(seq_mct_drv) : License information is available as a link from above
(seq_mct_drv) : ------------------------------------------------------------ (seq_mct_drv) : MODEL cesm
(seq_mct_drv) : ------------------------------------------------------------ (seq_mct_drv) : DATE 10/10/19 TIME 16:01:19 (seq_mct_drv) : ------------------------------------------------------------

ekluzek commented 5 years ago

The only thing that's in the user_nl_cpl for this case is histaux_l2x1yrg = .true.. I've tried running setting that to False, and it still fails. There isn't anything else unusual about this case. Here are the xml settings for it...

./xmlchange --force CLM_BLDNML_OPTS="-fire_emis" --append
./xmlchange --force BFBFLAG="TRUE"

There are other gnu tests that are passing for me though that set BFBFLAG.

SMS_Ld5.f10_f10_musgs.ISSP245Clm50BgcCrop.cheyenne_gnu.clm-ciso_dec2050Start.GC.rl-clm528chgnua/shell_commands:./xmlchange --force BFBFLAG="TRUE"
SMS_Ld5.f10_f10_musgs.ISSP370Clm50BgcCrop.cheyenne_gnu.clm-ciso_dec2050Start.GC.rl-clm528chgnua/shell_commands:./xmlchange --force BFBFLAG="TRUE"
SMS_Lm1.f10_f10_musgs.I1850Clm50BgcCropCmip6waccm.cheyenne_gnu.clm-basic.GC.rl-clm528chgnua/shell_commands:./xmlchange --force BFBFLAG="TRUE"
SMS_Ly1_Mmpi-serial.1x1_brazil.IHistClm50BgcQianGs.cheyenne_gnu.clm-output_bgc_highfreq.GC.rl-clm528chgnua/shell_commands:./xmlchange --force BFBFLAG="TRUE"
SMS_Ly1_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpGs.cheyenne_gnu.clm-output_sp_highfreq.GC.rl-clm528chgnua/shell_commands:./xmlchange --force BFBFLAG="TRUE"
SMS_Ly3_Mmpi-serial.1x1_numaIA.I2000Clm50BgcDvCropQianGs.cheyenne_gnu.clm-cropMonthOutput.GC.rl-clm528chgnua/shell_commands:./xmlchange --force BFBFLAG="TRUE"
ekluzek commented 5 years ago

OK, so works in cime5.6.22, but fails in cime5.6.23.

jedwards4b commented 5 years ago

This has something to do with the PE layout. If I set NTASKS=36, ROOTPE=0 it passes.

jedwards4b commented 5 years ago

The problem is in seq_flx_mct.F90: subroutine seq_flux_readnl_mct this is called with the cplid but the mpi_bcast at the end of the routine is not limited to the cpl and so the comm is invalid on some tasks. The solution is either to call this routine with GLOID or add

if (seq_comm_iamin(ID)) then 
endif

around the mpi_bcast.

ekluzek commented 5 years ago

The suggested fix works for me.