Closed rgknox closed 6 years ago
@rgknox this is good. We want to make sure we also pass this information onto CISL. So we should think about how we are going to do that. We could collect several things here, and then pass on the group to CISL. I just want to make sure we get that part done.
I'm also assuming you are talking about things where you resubmit and it works fine right? If it's stays stuck it's likely a problem in the code.
@rgknox; it's probably unrelated, but @billsacks and @ekluzek were discussing some reproducibility errors that they occasionally encounter on Cheyenne yesterday. This made me wonder whether you have situations like that too - (e.g. one test failing out of a whole load, and then randomly working again the next time). Given these appear to be stochastic, it's difficult to document them, but I just wanted to mention it incase these things could be related.
Sorry- that was cross-posted with @ekluzek 's reply.
All- Just wanted to chime in that I was seeing very similar errors when I was trying to run the ninst ensembles, https://github.com/NGEET/fates/issues/313. Not sure if they are related, but since they both show up as MPT crashes, seems like they might be.
In response to @ekluzek's questions: Yes, if I resubmit, then it passes the previous crash point. The number of timesteps between subsequent restarts seems pretty random. Also its happening on a different node each time. Also the crashes don't coincide with restart writes.
I should say that I am running into this using a regional grid over California (14 lon x 22 latitude gridcells, spread across 144 cpus, so 4 nodes). My log files look slightly different from what @rgknox posted. e.g. a couple different ones from multiple runs of the same case:
77: nstep = 825841
77: errsol = -1.199485950564849E-007
90: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
90: nstep = 825841
90: errsol = -1.101323050534120E-007
7: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
7: nstep = 825841
7: errsol = -1.084824816643959E-007
-1:MPT ERROR: MPI_COMM_WORLD rank 28 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 7
92: nstep = 688801
92: errsol = -1.477645810155082E-007
106: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
106: nstep = 688801
106: errsol = -1.020242734739440E-007
102:MPT ERROR: Rank 102(g:102) received signal SIGBUS(7).
102: Process ID: 63545, Host: r11i2n20, Program: /glade2/scratch2/charlie/fates_clm5_fullmodel_california_test2_3pfts_nohydromort_respthrott_nospitfire_storage1pt8_3061dd9_8e96aef/bld/cesm.exe
102: MPT Version: SGI MPT 2.15 12/18/16 02:58:06
102:
102:MPT: --------stack traceback-------
79: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
101: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
79: nstep = 688801
101: nstep = 688801
79: errsol = -1.060196836988325E-007
101: errsol = -1.395325170960859E-007
72: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
72: nstep = 688801
72: errsol = -1.031516490002105E-007
77: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
77: nstep = 688801
77: errsol = -1.187656835099915E-007
-1:MPT ERROR: MPI_COMM_WORLD rank 83 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 7
22: nstep = 594862
22: errsol = -1.065974402081338E-007
84: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
84: nstep = 594862
84: errsol = -1.009568109111569E-007
128: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
140: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
128: nstep = 594862
140: nstep = 594862
128: errsol = -1.057986196428828E-007
140: errsol = -1.074839133252681E-007
89: maxcohorts exceeded 5.500000081956387E-002
122:MPT ERROR: Rank 122(g:122) received signal SIGBUS(7).
122: Process ID: 2793, Host: r12i6n10, Program: /glade2/scratch2/charlie/fates_clm5_fullmodel_california_test2_3pfts_nohydromort_respthrott_nospitfire_storage1pt8_3061dd9_8e96aef/bld/cesm.exe
122: MPT Version: SGI MPT 2.15 12/18/16 02:58:06
122:
122:MPT: --------stack traceback-------
MPT: shepherd terminated: r12i6n10.ib0.cheyenne.ucar.edu - job aborting
Note also that there may be a memory leak, see #370 .
OK, I rebuilt the model with debug on, and got it to crash again. The tail of the cesm log follows:
27: nstep = 778658
27: errsol = -1.075301270248019E-007
57:forrtl: error (73): floating divide by zero
57:Image PC Routine Line Source
57:cesm.exe 0000000003EDE7A1 Unknown Unknown Unknown
57:cesm.exe 0000000003EDC8DB Unknown Unknown Unknown
57:cesm.exe 0000000003E8E3D4 Unknown Unknown Unknown
57:cesm.exe 0000000003E8E1E6 Unknown Unknown Unknown
57:cesm.exe 0000000003E0DCC9 Unknown Unknown Unknown
57:cesm.exe 0000000003E1A2F9 Unknown Unknown Unknown
57:libpthread-2.19.s 00002AAAAFAC1870 Unknown Unknown Unknown
57:cesm.exe 0000000002B7D757 dynpatchstateupda 189 dynPatchStateUpdaterMod.F90
57:cesm.exe 0000000000A1700C dynsubgriddriverm 284 dynSubgridDriverMod.F90
57:cesm.exe 000000000087E555 clm_driver_mp_clm 306 clm_driver.F90
57:cesm.exe 000000000084B5B9 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90
57:cesm.exe 000000000046BD2D component_mod_mp_ 688 component_mod.F90
57:cesm.exe 000000000043C474 cime_comp_mod_mp_ 2652 cime_comp_mod.F90
57:cesm.exe 00000000004543B7 MAIN__ 68 cime_driver.F90
57:cesm.exe 0000000000415A5E Unknown Unknown Unknown
57:libc-2.19.so 00002AAAB190AB25 __libc_start_main Unknown Unknown
57:cesm.exe 0000000000415969 Unknown Unknown Unknown
-1:MPT ERROR: MPI_COMM_WORLD rank 57 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 6
for reference, the above code was fates hash 8e96aef and ctsm https://github.com/ESCOMP/ctsm/commit/3061dd9
It looks like the error is being triggered by a zero patch weight? mabye?
On the fates side of things the flow of information is like this:
patch%total_canopy_area defines the area [m2] that the canopy of each fates patch takes up. This is calculated in EDCanopyStructureMod, canopy_summarization().
That subroutine is always called before update_hlm_dynamics(), also in the same module. There we calculate the output boundary condition: bc_out(s)%canopy_fraction_pa(ifp) Here, the total area of the canopy is converted into a fractional area.
This boundary condition is then used in the interface routine that calls those two routines, clmfates_interfaceMod.F90: wrap_update_hlmfates_dyn().
patch%wt_ed(bounds_clump%begp:bounds_clump%endp) is first zero'd.
Then we set the bare-ground patch as 1-sum of the fractions from fates.
Then we transfer the fractions into the patch%wt_ed structure.
I've been looking through these routine to see if anything looks awry.
An update to say that I am cautiously optimistic that #372 did indeed fix this. I am re-running the case that was giving me problems before, and it has gone 30 years so far with no problem, which is more than twice as long as any of the runs in that configuration before...
you just jinxed it
so far, so good... closing.
Looks like people (myself included) are experiencing MPT errors running fates on cheyenne.
Something like this:
The errors are fairly uninformative, so its hard to determine what is going on. I do know that most scenarios where they are being tripped, the simulations are multiple years in, gridded.
Please use this space to post MPT errors if you encounter them, and explain the run-time conditions so we can start collecting a log of how and when these things crop up. Please indicate the driver that you are using as well ( I believe I am encountering them in both ctsm/fates-clm contexts)
Things to note: ctsm vs. clm-fates, compset, grid, param file, etc