NGEET / fates

repository for the Functionally Assembled Terrestrial Ecosystem Simulator (FATES)
Other
100 stars 92 forks source link

4x5 default simulation crashes after 510 days #26

Closed rgknox closed 8 years ago

rgknox commented 8 years ago

Summary of Issue:

Simulation with default namelist options, at res 4x5, crashes after 510 days

Steps to reproduce the problem (should include create_newcase or create_test command along with any user_nl or xml changes):

!/bin/bash

COMP=ICLM45ED MACHINE=lawrencium-lr2 COMPILER=intel GITHASH=git log -n 1 --format=%h CASE=bugbranch_4x55yr${GITHASH} CROOT=$GSCRATCH/clmed-tests/ WORKDIR=pwd export CASEROOT=${CROOT}/${CASE} echo "CREATING NEW CASE IN "${CASEROOT} rm -rf ${CASEROOT}

./create_newcase -case ${CASEROOT} -res 4x5_4x5 -compset ${COMP} -mach ${MACHINE} -compiler ${COMPILER}

cd ${CASEROOT}

./xmlchange -file env_run.xml -id STOP_OPTION -val nyears ./xmlchange -file env_run.xml -id STOP_N -val 5 ./xmlchange -file env_build.xml -id CESMSCRATCHROOT -val ${CASEROOT}

./xmlchange -file env_build.xml -id DEBUG -val TRUE

./xmlchange -file env_build.xml -id SUPPORTED_BY -val 'clm-ed test case' ./xmlchange -file env_build.xml -id EXEROOT -val ${CASEROOT}/bld ./xmlchange -file env_run.xml -id REST_N -val 1 ./xmlchange -file env_run.xml -id DOUT_S_SAVE_INTERIM_RESTART_FILES -val TRUE ./xmlchange -file env_run.xml -id DOUT_S_SAVE_EVERY_NTH_RESTART_FILE_SET -val 1 ./xmlchange -file env_run.xml -id DOUT_S -val TRUE ./xmlchange -file env_run.xml -id DOUT_S_ROOT -val '$CASEROOT/restarts' ./xmlchange -file env_run.xml -id RUNDIR -val ${CASEROOT}/run

./cesm_setup

cat >> user_nl_clm << \EOF
finidat = ''
hist_mfilt = 1
hist_nhtfrq = -8760
EOF
cat >> user_nl_datm << EOF
EOF

./${CASE}.build

What is the changeset ID of the code, and the machine you are using:

8740a1a

have you modified the code? If so, it must be committed and available for testing:

no

Screen output or output files showing the error message and context:

lnd.log reports nothing fishy, it appears as though it completed day 510 (redundant modelday reports are from different tasks):

modelday 510.000000000000
modelday 510.000000000000
modelday 510.000000000000
modelday 510.000000000000
modelday 510.000000000000
modelday 510.000000000000
clm: leaving ED model 1 184 510 end run_mct

from cesm.log.xxxxx-xxxxx:

... ... in run_mct in run_mct in run_mct in run_mct trimming patch area - is too big 1.818989403545856E-012 trimming patch area - is too big 1.818989403545856E-012 trimming patch area - is too big 1.818989403545856E-012 trimming patch area - is too big 1.818989403545856E-012 trimming patch area - is too big 1.818989403545856E-012 trimming patch area - is too big 1.818989403545856E-012 end run_mct end run_mct end run_mct end run_mct end run_mct end run_mct forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source
cesm.exe 0000000000FC8231 Unknown Unknown Unknown cesm.exe 0000000000FC66E7 Unknown Unknown Unknown cesm.exe 0000000000F68774 Unknown Unknown Unknown cesm.exe 0000000000F68586 Unknown Unknown Unknown cesm.exe 0000000000EFD916 Unknown Unknown Unknown cesm.exe 0000000000F08C4D Unknown Unknown Unknown libpthread.so.0 00002AF41A08D710 Unknown Unknown Unknown cesm.exe 0000000000AFF5F5 edcohortdynamicsm 727 EDCohortDynamicsMod.F90 cesm.exe 00000000007CEE49 edphysiologymod_m 1033 EDPhysiologyMod.F90 cesm.exe 00000000007B6B5E edmainmod_mp_ed_d 169 EDMainMod.F90 cesm.exe 00000000005056BA clm_driver_mp_clm 1025 clm_driver.F90 cesm.exe 00000000004F58A9 lnd_comp_mct_mp_l 437 lnd_comp_mct.F90 cesm.exe 000000000042AA8D component_modmp 1044 component_mod.F90 cesm.exe 00000000004143AF cesm_comp_modmp 2415 cesm_comp_mod.F90 cesm.exe 000000000042838D MAIN__ 93 cesm_driver.F90 cesm.exe 000000000041208E Unknown Unknown Unknown libc.so.6 00002AF41A2BAD5D Unknown Unknown Unknown

cesm.exe 0000000000411F99 Unknown Unknown Unknown

Primary job terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[45070,1],3]

Exit code: 174

rgknox commented 8 years ago

I believe we have a working solution to this error. To summarize, a situation was occurring in the 4x5 simulation where new cohort recruits were being created with extremely low number densities. Since there is no cohort termination filter immediately following that call, these low density cohorts are passed to the fusion routine where their small numbers screw up math (especially the divisions in that routine). The correction is to simply apply a number density filter during recruitment, and only add the new cohort to the patch if its numbers are above a minimum threshold.

Corrections to this behavior are passing some early checks and tests. I still need to make sure that this very small amount of biomass is passed to the coarse woody debris pool (or perhaps litter flux?)

Note: terminate_cohort() does perform the transfer of carbon I am talking about, I just want to make sure that using this machinery or a call to that routine will work at the current location in the code

rosiealice commented 8 years ago

Nice job Ryan etc.

One thing I've generally stumbled on with these kind of 'termination' criteria is this: Should the number density cutoff should be a density (ind. per area) rather than an absolute value to avoid penalizing small patches, or should it be the absolute value of cohort%n so that the small numbers can't be inadvertently caused by very small patch areas? (following this, should the whole fuse_cohorts scheme be changed to somehow use number density rather than absolute values?

On 4 March 2016 at 12:06, Ryan Knox notifications@github.com wrote:

I believe we have a working solution to this error. To summarize, a situation was occurring in the 4x5 simulation where new cohort recruits were being created with extremely low number densities. Since there is no cohort termination filter immediately following that call, these low density cohorts are passed to the fusion routine where their small numbers screw up math (especially the divisions in that routine). The correction is to simply apply a number density filter during recruitment, and only add the new cohort to the patch if its numbers are above a minimum threshold.

Corrections to this behavior are passing some early checks and tests. I still need to make sure that this very small amount of biomass is passed to the coarse woody debris pool (or perhaps litter flux?)

— Reply to this email directly or view it on GitHub https://github.com/NGEET/ed-clm/issues/26#issuecomment-192416008.


Dr Rosie A. Fisher

Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

rgknox commented 8 years ago

Changing my approach. I noted that recruitment is called inside edmain. After recruitment, edmain calls terminate_cohort and then fuses and sorts.

So the order of operations was:

recruit fuse (in the recruit subroutine) <---- removing sort (in the recruit subroutine) <---- removing terminate (in edmain) fuse (in edmain) sort (in edmain)

So, all I have to do is remove the call to fuse inside the recruitment call, because it is redundant and has not had termination yet. This is better because it removes a redundant call, and uses terminate_cohorts() (rightfully) to check those recruits and handle carbon appropriately.

rgknox commented 8 years ago

@rosiealice: I was thinking and playing around with this too. I think the termination criteria is per area right now. And it makes sense to me. One criteria (absolute number) is better for well behaved math, and one criteria (density) is probably more scientifically meaningful.

I think the existing threshold stated fewer than 1e-4 cohorts per m2 were not worth tracking. At least for terminating cohorts, i don't see any reason we can have both criteria, so if there is a very very small and also a very small number of cohorts on that patch, we don't run into numerical weirdness.

rgknox commented 8 years ago

PR submitted #28