Open JoshuaRady opened 4 years ago
@JoshuaRady is there anything informative in the cesm or lnd run logs in the run directory?
The stack trace above is from the CESM log of the crashing process (9 in this case). The land logs all end uninformatively wherever they happened to be with no error messages.
Digging back through the history of this issue I (re)found a case where the CESM log file provided a more informative stack trace. It makes clear what it happening but I don't know why it only happens with some instances and not others.
340268 4: NetCDF: Index exceeds dimension bound 340269 4: pio_support::pio_die:: myrank= -1 : ERROR: 340270 4: pionfwrite_mod::write_nfdarray_int: 250 : 340271 4: NetCDF: Index exceeds dimension bound 340272 4:Image PC Routine Line Source 340273 4:cesm.exe 00000000015C74FD Unknown Unknown Unknown 340274 4:cesm.exe 0000000000EE6191 pio_support_mp_pi 118 pio_support.F90 340275 4:cesm.exe 0000000000EE43BE pio_utils_mp_chec 74 pio_utils.F90 340276 4:cesm.exe 0000000000FE95FA pionfwrite_mod_mp 250 pionfwrite_mod.F90.in 340277 4:cesm.exe 0000000000FAF46F piodarray_mp_writ 650 piodarray.F90.in 340278 4:cesm.exe 0000000000FB17C4 piodarray_mp_writ 221 piodarray.F90.in 340279 4:cesm.exe 00000000005E887C ncdio_pio_mpncd 1657 ncdio_pio.F90.in 340280 4:cesm.exe 0000000000618C81 restutilmod_mp_re 344 restUtilMod.F90.in 340281 4:cesm.exe 000000000052221F clmfatesinterface 1103 clmfates_interfaceMod.F90 340282 4:cesm.exe 0000000000509429 clm_instmod_mp_cl 543 clm_instMod.F90 340283 4:cesm.exe 000000000060BDE6 restfilemod_mp_re 119 restFileMod.F90 340284 4:cesm.exe 00000000004FEC23 clm_driver_mp_clm 1168 clm_driver.F90 340285 4:cesm.exe 00000000004EBFD0 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90 340286 4:cesm.exe 0000000000425A58 component_modmp 724 component_mod.F90 340287 4:cesm.exe 0000000000409C2A cime_comp_modmp 2447 cime_comp_mod.F90 340288 4:cesm.exe 00000000004256EC MAIN__ 133 cime_driver.F90 340289 4:cesm.exe 0000000000407EDE Unknown Unknown Unknown 340290 4:libc.so.6 00002B6D6D07B6E5 __libc_start_main Unknown Unknown 340291 4:cesm.exe 0000000000407DE9 Unknown Unknown Unknown 340292 4:MPT ERROR: Rank 4(g:4) is aborting with error code 1. 340293 4: Process ID: 12893, Host: r14i7n7, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_KingAndQueen_AllPlots_LLpftP_2/bld/cesm.exe 340294 4: MPT Version: HPE MPT 2.19 02/23/19 05:30:09 340295 4: 340296 4:MPT: --------stack traceback-------
Running CLM-FATES version fates_s1.31.0_api.8.0.0 with increased size bins (fates_history_sizeclass_bin_edges & fates_history_height_bin_edges, 302 values each) simulations sometimes crash while writing the restart files.
Whether or not a crash occurs depends in some way on parameter file differences. I have not been able to determine which parameters are associated with the failure as I am running a single point in multi-instance mode and the offending process results in the early termination of some other simulations before they try to write their restart files. Most simulations that do finish write their restart files successfully.
These are one-off simulations so I have been changing XML setting REST_OPTION=never. However, I imagine this could present a potential problem in the future. This seems like a low priority issue for the community but I wanted people to be aware of it.
An example stack trace, which I don't find very informative, is: ... 260551 9:MPT: #2 MPI_SGI_stacktraceback ( 260552 9:MPT: header=header@entry=0x7ffcf009da40 "MPT ERROR: Rank 9(g:9) received signal SIGSEGV(11).\n\tProcess ID: 24926, Host: r11i4n21, Program: /glade/scratch/jmrady/FATES_VTSpacingTrial_Halif axCoNC_AllPlots_LLpftP_2/bld/cesm.exe\n\tMPT Version: HPE"...) at sig.c:340 260553 9:MPT: #3 0x00002b68eed07fb2 in first_arriver_handler (signo=signo@entry=11, 260554 9:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b68f9300080) at sig.c:489 260555 9:MPT: #4 0x00002b68eed0834b in slave_sig_handler (signo=11, 260556 9:MPT: siginfo=, extra=) at sig.c:564
260557 9:MPT: #5
260558 9:MPT: #6 0x000000000081844a in fatesrestartinterfacemod_mp_set_restartvectors ()
260559 9:MPT: at /glade/work/jmrady/ClmVersions/fates_s1.31.0_api.8.0.0/cime/../src/fates/main/FatesRestartInterfaceMod.F90:1808
260560 9:MPT: #7 0x0000000000521981 in clmfatesinterfacemod_mprestart ()
...