MPT issues on cheyenne - Githubissues

rgknox commented 6 years ago

Looks like people (myself included) are experiencing MPT errors running fates on cheyenne.

Something like this:

0: memory_write: model date =   181206       0 memory =      87.40 MB (highwater)     129431.40 MB (usage)  (pe=    0 comps= ATM)
117:MPT ERROR: Rank 117(g:117) received signal SIGBUS(7).
117:    Process ID: 66993, Host: r2i0n23, Program: /glade2/scratch2/rgknox/fates-clm-tests/iclm45fates-finit-s1.9.0-a3.1.0-2trop-f45/bld/cesm.exe
117:    MPT Version: SGI MPT 2.15  09/03/16 04:15:54
117:
117:MPT: --------stack traceback-------

The errors are fairly uninformative, so its hard to determine what is going on. I do know that most scenarios where they are being tripped, the simulations are multiple years in, gridded.

Please use this space to post MPT errors if you encounter them, and explain the run-time conditions so we can start collecting a log of how and when these things crop up. Please indicate the driver that you are using as well ( I believe I am encountering them in both ctsm/fates-clm contexts)

Things to note: ctsm vs. clm-fates, compset, grid, param file, etc

ekluzek commented 6 years ago

@rgknox this is good. We want to make sure we also pass this information onto CISL. So we should think about how we are going to do that. We could collect several things here, and then pass on the group to CISL. I just want to make sure we get that part done.

I'm also assuming you are talking about things where you resubmit and it works fine right? If it's stays stuck it's likely a problem in the code.

rosiealice commented 6 years ago

@rgknox; it's probably unrelated, but @billsacks and @ekluzek were discussing some reproducibility errors that they occasionally encounter on Cheyenne yesterday. This made me wonder whether you have situations like that too - (e.g. one test failing out of a whole load, and then randomly working again the next time). Given these appear to be stochastic, it's difficult to document them, but I just wanted to mention it incase these things could be related.

rosiealice commented 6 years ago

Sorry- that was cross-posted with @ekluzek 's reply.

ckoven commented 6 years ago

All- Just wanted to chime in that I was seeing very similar errors when I was trying to run the ninst ensembles, https://github.com/NGEET/fates/issues/313. Not sure if they are related, but since they both show up as MPT crashes, seems like they might be.

ckoven commented 6 years ago

In response to @ekluzek's questions: Yes, if I resubmit, then it passes the previous crash point. The number of timesteps between subsequent restarts seems pretty random. Also its happening on a different node each time. Also the crashes don't coincide with restart writes.

ckoven commented 6 years ago

I should say that I am running into this using a regional grid over California (14 lon x 22 latitude gridcells, spread across 144 cpus, so 4 nodes). My log files look slightly different from what @rgknox posted. e.g. a couple different ones from multiple runs of the same case:

77: nstep         =       825841
77: errsol        =  -1.199485950564849E-007
90: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
90: nstep         =       825841
90: errsol        =  -1.101323050534120E-007
7: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
7: nstep         =       825841
7: errsol        =  -1.084824816643959E-007
-1:MPT ERROR: MPI_COMM_WORLD rank 28 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 7

92: nstep         =       688801
92: errsol        =  -1.477645810155082E-007
106: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
106: nstep         =       688801
106: errsol        =  -1.020242734739440E-007
102:MPT ERROR: Rank 102(g:102) received signal SIGBUS(7).
102:    Process ID: 63545, Host: r11i2n20, Program: /glade2/scratch2/charlie/fates_clm5_fullmodel_california_test2_3pfts_nohydromort_respthrott_nospitfire_storage1pt8_3061dd9_8e96aef/bld/cesm.exe
102:    MPT Version: SGI MPT 2.15  12/18/16 02:58:06
102:
102:MPT: --------stack traceback-------
79: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
101: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
79: nstep         =       688801
101: nstep         =       688801
79: errsol        =  -1.060196836988325E-007
101: errsol        =  -1.395325170960859E-007
72: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
72: nstep         =       688801
72: errsol        =  -1.031516490002105E-007
77: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
77: nstep         =       688801
77: errsol        =  -1.187656835099915E-007
-1:MPT ERROR: MPI_COMM_WORLD rank 83 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 7

22: nstep         =       594862
22: errsol        =  -1.065974402081338E-007
84: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
84: nstep         =       594862
84: errsol        =  -1.009568109111569E-007
128: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
140: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
128: nstep         =       594862
140: nstep         =       594862
128: errsol        =  -1.057986196428828E-007
140: errsol        =  -1.074839133252681E-007
89: maxcohorts exceeded  5.500000081956387E-002
122:MPT ERROR: Rank 122(g:122) received signal SIGBUS(7).
122:    Process ID: 2793, Host: r12i6n10, Program: /glade2/scratch2/charlie/fates_clm5_fullmodel_california_test2_3pfts_nohydromort_respthrott_nospitfire_storage1pt8_3061dd9_8e96aef/bld/cesm.exe
122:    MPT Version: SGI MPT 2.15  12/18/16 02:58:06
122:
122:MPT: --------stack traceback-------
MPT: shepherd terminated: r12i6n10.ib0.cheyenne.ucar.edu - job aborting

rgknox commented 6 years ago

Note also that there may be a memory leak, see #370 .

ckoven commented 6 years ago

OK, I rebuilt the model with debug on, and got it to crash again. The tail of the cesm log follows:

27: nstep         =       778658
27: errsol        =  -1.075301270248019E-007
57:forrtl: error (73): floating divide by zero
57:Image              PC                Routine            Line        Source             
57:cesm.exe           0000000003EDE7A1  Unknown               Unknown  Unknown
57:cesm.exe           0000000003EDC8DB  Unknown               Unknown  Unknown
57:cesm.exe           0000000003E8E3D4  Unknown               Unknown  Unknown
57:cesm.exe           0000000003E8E1E6  Unknown               Unknown  Unknown
57:cesm.exe           0000000003E0DCC9  Unknown               Unknown  Unknown
57:cesm.exe           0000000003E1A2F9  Unknown               Unknown  Unknown
57:libpthread-2.19.s  00002AAAAFAC1870  Unknown               Unknown  Unknown
57:cesm.exe           0000000002B7D757  dynpatchstateupda         189  dynPatchStateUpdaterMod.F90
57:cesm.exe           0000000000A1700C  dynsubgriddriverm         284  dynSubgridDriverMod.F90
57:cesm.exe           000000000087E555  clm_driver_mp_clm         306  clm_driver.F90
57:cesm.exe           000000000084B5B9  lnd_comp_mct_mp_l         451  lnd_comp_mct.F90
57:cesm.exe           000000000046BD2D  component_mod_mp_         688  component_mod.F90
57:cesm.exe           000000000043C474  cime_comp_mod_mp_        2652  cime_comp_mod.F90
57:cesm.exe           00000000004543B7  MAIN__                     68  cime_driver.F90
57:cesm.exe           0000000000415A5E  Unknown               Unknown  Unknown
57:libc-2.19.so       00002AAAB190AB25  __libc_start_main     Unknown  Unknown
57:cesm.exe           0000000000415969  Unknown               Unknown  Unknown
-1:MPT ERROR: MPI_COMM_WORLD rank 57 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 6

ckoven commented 6 years ago

for reference, the above code was fates hash 8e96aef and ctsm https://github.com/ESCOMP/ctsm/commit/3061dd9

rgknox commented 6 years ago

It looks like the error is being triggered by a zero patch weight? mabye?

On the fates side of things the flow of information is like this:

patch%total_canopy_area defines the area [m2] that the canopy of each fates patch takes up. This is calculated in EDCanopyStructureMod, canopy_summarization().

That subroutine is always called before update_hlm_dynamics(), also in the same module. There we calculate the output boundary condition: bc_out(s)%canopy_fraction_pa(ifp) Here, the total area of the canopy is converted into a fractional area.

This boundary condition is then used in the interface routine that calls those two routines, clmfates_interfaceMod.F90: wrap_update_hlmfates_dyn().

patch%wt_ed(bounds_clump%begp:bounds_clump%endp) is first zero'd.

Then we set the bare-ground patch as 1-sum of the fractions from fates.

Then we transfer the fractions into the patch%wt_ed structure.

I've been looking through these routine to see if anything looks awry.

ckoven commented 6 years ago

An update to say that I am cautiously optimistic that #372 did indeed fix this. I am re-running the case that was giving me problems before, and it has gone 30 years so far with no problem, which is more than twice as long as any of the runs in that configuration before...

rgknox commented 6 years ago

you just jinxed it

ckoven commented 6 years ago

so far, so good... closing.

NGEET / fates

MPT issues on cheyenne #368