Balance Check failure in fire runs

jkshuman commented 6 years ago

Getting a fail in fire runs. Seems to be due to a Balance Check. This happens in both CLM45 runs and CLM5 runs at year 5 with 2PFTs (Trop tree and Grass). Non-fire runs haven't failed through year 10, but will resubmit longer. ctsm git hash: 2dba074 fates git hash: f8d7693 Here is the create case statement: ./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupported

from within cesm.log (and end of cesm.log below) 396: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 396: nstep = 96934 396: errsol = -1.031027636599902E-007 529: Large Dir Radn consvn error 87346.4733653322 1 2 529: diags 46218.1932574409 -0.338494232152740 589450.614042712
529: -394259.718697869
529: lai_change 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002 529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002 529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: ftweight 1.00000000000000 0.000000000000000E+000 529: 0.000000000000000E+000 1.00000000000000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: cp 9.580078716659667E-011 1 529: bc_in(s)%albgr_dir_rb(ib) 0.557730205770928
529: >5% Dif Radn consvn error -2474470293.77894 1 2 529: diags 639144447.809849 -10366553911.8306 6420139512.41898
529: lai_change 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002 529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002 529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: ftweight 0.000000000000000E+000 0.000000000000000E+000 529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000 529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000 529: 31.0465442101942 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: cp 9.580078716659667E-011 1 529: bc_in(s)%albgr_dif_rb(ib) 0.557730205770928
529: rhol 0.100000001490116 0.100000001490116 0.100000001490116
529: 0.449999988079071 0.449999988079071 0.349999994039536
529: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 529: 0.000000000000000E+000 529: present 1 0 0 529: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 465: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 465: nstep = 96935 465: errsol = -1.048202307174506E-007 433: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 433: nstep = 96935 433: errsol = -1.017730255625793E-007 358: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 358: nstep = 96936 358: errsol = -1.278503987123258E-007 432: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 432: nstep = 96936 432: errsol = -1.040576194100140E-007 431: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 431: nstep = 96936 431: errsol = -1.129041606873216E-007 466: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 466: nstep = 96936 466: errsol = -1.248336616299639E-007 433: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 433: nstep = 96936 433: errsol = -1.003071474769968E-007 529: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 529: nstep = 96936 529: errsol = 1.383552742595384E-005 529: clm model is stopping - error is greater than 1e-5 (W/m2) 529: fsa = 12787101170.2958
529: fsr = -12787101148.9356
529: forc_solad(1) = 2.30644280577964
529: forc_solad(2) = 3.71261017842798
529: forc_solai(1) = 8.37364785641270
529: forc_solai(2) = 6.96748048376436
529: forc_tot = 21.3601813243847
529: clm model is stopping 529: calling getglobalwrite with decomp_index= 39670 and clmlevel= pft 529: local patch index = 39670 529: global patch index = 15897 529: global column index = 8008 529: global landunit index = 2104 529: global gridcell index = 494 529: gridcell longitude = 290.000000000000
529: gridcell latitude = -15.5497382198953
529: pft type = 1 529: column type = 1 529: landunit type = 1 529: ENDRUN: 529: ERROR in BalanceCheckMod.F90 at line 543
396: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 396: nstep = 96934 396: errsol = -1.031027636599902E-007 529: Large Dir Radn consvn error 87346.4733653322 1 2 529: diags 46218.1932574409 -0.338494232152740 589450.614042712
529: -394259.718697869
529: lai_change 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002 529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002 529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: ftweight 1.00000000000000 0.000000000000000E+000 529: 0.000000000000000E+000 1.00000000000000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: cp 9.580078716659667E-011 1 529: bc_in(s)%albgr_dir_rb(ib) 0.557730205770928
529: >5% Dif Radn consvn error -2474470293.77894 1 2 529: diags 639144447.809849 -10366553911.8306 6420139512.41898
529: lai_change 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 6.38062653664038 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: elai 0.000000000000000E+000 0.000000000000000E+000 0.961064260932761
529: 0.000000000000000E+000 0.000000000000000E+000 0.958469792135196
529: 0.000000000000000E+000 0.000000000000000E+000 0.122722763358372
529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: esai 0.000000000000000E+000 0.000000000000000E+000 3.893573906723917E-002 529: 0.000000000000000E+000 0.000000000000000E+000 3.883117669682943E-002 529: 0.000000000000000E+000 0.000000000000000E+000 4.984874625802597E-003 529: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 529: ftweight 0.000000000000000E+000 0.000000000000000E+000 529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000 529: 37.4271707468345 0.000000000000000E+000 0.000000000000000E+000 529: 31.0465442101942 0.000000000000000E+000 0.000000000000000E+000 529: 0.000000000000000E+000 529: cp 9.580078716659667E-011 1 529: bc_in(s)%albgr_dif_rb(ib) 0.557730205770928
529: rhol 0.100000001490116 0.100000001490116 0.100000001490116
529: 0.449999988079071 0.449999988079071 0.349999994039536
529: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 529: 0.000000000000000E+000 529: present 1 0 0 529: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 465: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 465: nstep = 96935 465: errsol = -1.048202307174506E-007 433: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 433: nstep = 96935 433: errsol = -1.017730255625793E-007 358: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 358: nstep = 96936 358: errsol = -1.278503987123258E-007 432: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 432: nstep = 96936 432: errsol = -1.040576194100140E-007 431: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 431: nstep = 96936 431: errsol = -1.129041606873216E-007 466: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 466: nstep = 96936 466: errsol = -1.248336616299639E-007 433: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 433: nstep = 96936 433: errsol = -1.003071474769968E-007 529: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 529: nstep = 96936 529: errsol = 1.383552742595384E-005 529: clm model is stopping - error is greater than 1e-5 (W/m2) 529: fsa = 12787101170.2958
529: fsr = -12787101148.9356
529: forc_solad(1) = 2.30644280577964
529: forc_solad(2) = 3.71261017842798
529: forc_solai(1) = 8.37364785641270
529: forc_solai(2) = 6.96748048376436
529: forc_tot = 21.3601813243847
529: clm model is stopping 529: calling getglobalwrite with decomp_index= 39670 and clmlevel= pft 529: local patch index = 39670 529: global patch index = 15897 529: global column index = 8008 529: global landunit index = 2104 529: global gridcell index = 494 529: gridcell longitude = 290.000000000000
529: gridcell latitude = -15.5497382198953
529: pft type = 1 529: column type = 1 529: landunit type = 1 529: ENDRUN: 529: ERROR in BalanceCheckMod.F90 at line 543
529:
529:
529:
529:
529:
529: ERROR: Unknown error submitted to shr_abort_abort. 413: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 413: nstep = 96936 413: errsol = -1.288894111439731E-007 397: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 397: nstep = 96937 397: errsol = -1.022812625706138E-007 319: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 319: nstep = 96937 319: errsol = -1.036731305248395E-007 395: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 395: nstep = 96937 395: errsol = -1.211479911944480E-007 432: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 432: nstep = 96937 432: errsol = -1.264885440832586E-007 464: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 464: nstep = 96937 464: errsol = -1.101450379792368E-007 431: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 431: nstep = 96937 431: errsol = -1.387476800118748E-007 433: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 433: nstep = 96937 433: errsol = -1.261905708815902E-007 529:Image PC Routine Line Source
529:cesm.exe 0000000001237DAD Unknown Unknown Unknown 529:cesm.exe 0000000000D1B432 shr_abort_modmp 114 shr_abort_mod.F90 529:cesm.exe 0000000000503CD5 abortutils_mp_end 77 abortutils.F90 529:cesm.exe 0000000000677E2D balancecheckmod_m 543 BalanceCheckMod.F90 529:cesm.exe 000000000050AF77 clm_driver_mp_clm 924 clm_driver.F90 529:cesm.exe 00000000004F9516 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90 529:cesm.exe 0000000000430E14 component_modmp 688 component_mod.F90 529:cesm.exe 0000000000417D59 cime_comp_modmp 2652 cime_comp_mod.F90 529:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90 529:cesm.exe 0000000000415C5E Unknown Unknown Unknown 529:libc-2.19.so 00002AAAB190AB25 libc_start_main Unknown Unknown 529:cesm.exe 0000000000415B69 Unknown Unknown Unknown 529:MPT ERROR: Rank 529(g:529) is aborting with error code 1001. 529: Process ID: 53637, Host: r12i2n18, Program: /glade2/scratch2/jkshuman/Fire0504_Obrienh_Saldaa_Saldal_agb1zero_2PFT_1x1_2dba074_f8d7693/bld/cesm.exe 529: MPT Version: SGI MPT 2.15 12/18/16 02:58:06 529: 529:MPT: --------stack traceback------- 0: memory_write: model date = 60715 0 memory = 65749.16 MB (highwater) 102.04 MB (usage) (pe= 0 comps= ATM ESP) 529:MPT: Attaching to program: /proc/53637/exe, process 53637 529:MPT: done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=3d290be00d48b823d3b71df2249e80d881bc473d" 529:MPT: (no debugging symbols found)...done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=5409c48fdb15e90649c1407e444fbe31d6dc8ec1" 529:MPT: (no debugging symbols found)...done. 529:MPT: [Thread debugging using libthread_db enabled] 529:MPT: Using host libthread_db library "/glade/u/apps/ch/os/lib64/libthread_db.so.1". 529:MPT: Try: zypper install -C "debuginfo(build-id)=e97cfdb062d6f0c41073f2109a7605d0ae991c03" 529:MPT: (no debugging symbols found)...done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=f43d7754940a14ffe3d9bd8fc9472ffbbfead544" 529:MPT: (no debugging symbols found)...done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=0ea764119690f32c98faae9a63a73f35ed8b1099" 529:MPT: (no debugging symbols found)...done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=15916519d9dbaea26ec88427460b4cedb9c0a6ab" 529:MPT: (no debugging symbols found)...done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=79264652a62453da222372a430cd9351d4bbcbde" 529:MPT: (no debugging symbols found)...done. 529:MPT: Try: zypper install -C "debuginfo(build-id)=68682e9ac223d269cbecb94315fcec5e16b32bfb" 529:MPT: (no debugging symbols found)...done. 529:MPT: 0x00002aaaafac141c in waitpid () from /glade/u/apps/ch/os/lib64/libpthread.so.0 529:MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.19-35.1.x86_64 529:MPT: (gdb) #0 0x00002aaaafac141c in waitpid () 529:MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0 529:MPT: #1 0x00002aaab16215d6 in mpi_sgi_system ( 529:MPT: #2 MPI_SGI_stacktraceback ( 529:MPT: header=header@entry=0x7ffffffeeb70 "MPT ERROR: Rank 529(g:529) is aborting with error code 1001.\n\tProcess ID: 53637, Host: r12i2n18, Program: /glade2/scratch2/jkshuman/Fire0504_Obrienh_Saldaa_Saldal_agb1zero_2PFT_1x1_2dba074_f8d7693/bld"...) at sig.c:339 529:MPT: #3 0x00002aaab1574d6f in print_traceback (ecode=ecode@entry=1001) 529:MPT: at abort.c:227 529:MPT: #4 0x00002aaab1574fda in PMPI_Abort (comm=, errorcode=1001) 529:MPT: at abort.c:66 529:MPT: #5 0x00002aaab157528d in pmpi_abort () 529:MPT: from /opt/sgi/mpt/mpt-2.15/lib/libmpi.so 529:MPT: #6 0x0000000000e191a9 in shr_mpi_mod_mp_shr_mpiabort () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/share/util/shr_mpi_mod.F90:2132 529:MPT: #7 0x0000000000d1b4d8 in shr_abort_mod_mp_shr_abortabort () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/share/util/shr_abort_mod.F90:69 529:MPT: #8 0x0000000000503cd5 in abortutils_mp_endrunglobalindex () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/main/abortutils.F90:77 529:MPT: #9 0x0000000000677e2d in balancecheckmod_mpbalancecheck () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/biogeophys/BalanceCheckMod.F90:543 529:MPT: #10 0x000000000050af77 in clm_driver_mp_clmdrv () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/main/clm_driver.F90:924 529:MPT: #11 0x00000000004f9516 in lnd_comp_mct_mp_lnd_runmct () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/src/cpl/lnd_comp_mct.F90:451 529:MPT: #12 0x0000000000430e14 in component_mod_mp_componentrun () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/drivers/mct/main/component_mod.F90:688 529:MPT: #13 0x0000000000417d59 in cime_comp_mod_mp_cimerun () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/drivers/mct/main/cime_comp_mod.F90:2652 529:MPT: #14 0x0000000000430b3d in MAIN__ () 529:MPT: at /glade/p/work/jkshuman/git/ctsm/cime/src/drivers/mct/main/cime_driver.F90:68 529:MPT: #15 0x0000000000415c5e in main () 529:MPT: (gdb) A debugging session is active. 529:MPT: 529:MPT: Inferior 1 [process 53637] will be detached. 529:MPT: 529:MPT: Quit anyway? (y or n) [answered Y; input not from terminal] 529:MPT: Detaching from program: /proc/53637/exe, process 53637 529: 529:MPT: -----stack traceback ends----- -1:MPT ERROR: MPI_COMM_WORLD rank 529 has terminated without calling MPI_Finalize() -1: aborting job

jkshuman commented 6 years ago

@ekluzek @rosiealice @rgknox @ckoven I am getting a balance check error in the fire runs. This is using the latest fates version which incorporates the memory leak fix, and was merged with an added history variable from my branch. The error that is writing out is within the CLM BalanceCheckMod.f90. The system is down, and I can't get more information at the moment. When I was looking at it last night, I submitted the run with a switch from nyears to nmonths. As I was watching the file list in the case/run folder the cesm.log would pop up and then disappear. I was not able to see if it finally appeared last night. I haven't seen that behavior before (inability to write the cesm.log.). I did cancel the run and restart, and it was the same behavior where the cesm log would appear and disappear. I will try resubmitting with stop_option set to ndays - maybe it isn't completing the month? Any advice/help would be appreciated on what to look for, etc.

Erik - does this look at all similar to the balance check error we saw in the past?

rgknox commented 6 years ago

Some things I'm noticing: The radiation solution errors are quite large, so if they are that large, I would not be surprised if they will generate a NaN, or cause anarchy anywhere in the code down-stream. These errors appear to be triggered over and over again in the same patch. The patch area is e-11 in size, which seems like maybe it should be culled? In the arrays that are printed out, lai_change, elai, ftweight, etc. I'm surprised that there are some lai_change values (which is change in light level, per change in lai, maybe..) where I see no tai. But its hard to tell why this is so.
I'm wondering if perhaps the "ftweight" variable is being filled incorrectly, and maybe because there is something special about the grasses. I can't really tell exactly what is happening though, also the diagnostic that writes this stuff uses canopy layer 1 for ftweight, but ncl_p for the others...

Do these runs have grasses with some structural biomass, or are they 0 structure/sap?

jkshuman commented 6 years ago

allom_latosa_int = zero. but had a variant with allom_agb1=zero and allom_agb1=0.0001 (both variants failed.)

will try a variant with allom_latosa_int set to default and allom_agb1=0.0001

Jacquelyn Shuman, PhD Terrestrial Sciences Section National Center for Atmospheric Research PO Box 3000 Boulder, Colorado 80307-3000 USA

jkshuman@ucar.edu office: +1-303-497-1787

On Sun, May 6, 2018 at 9:45 PM, Ryan Knox notifications@github.com wrote:

Some things I'm noticing: The radiation solution errors are quite large, so if they are that large, I would not be surprised if they will generate a NaN, or cause anarchy anywhere in the code down-stream. These errors appear to be triggered over and over again in the same patch. The patch area is e-11 in size, which seems like maybe it should be culled? In the arrays that are printed out, lai_change, elai, ftweight, etc. I'm surprised that there are some lai_change values (which is change in light level, per change in lai, maybe..) where I see no tai. But its hard to tell why this is so. I'm wondering if perhaps the "ftweight" variable is being filled incorrectly, and maybe because there is something special about the grasses. I can't really tell exactly what is happening though, also the diagnostic that writes this stuff uses canopy layer 1 for ftweight, but ncl_p for the others...

Do these runs have grasses with some structural biomass, or are they 0 structure/sap?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-386949902, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFDhvYsfd6mw0wkl7_aK_Bw9SKixv1zks5tv8NygaJpZM4Tzp8E .

jkshuman commented 6 years ago

Run which uses allom_latosa_int = default and allom_agb1=0.0001 for grass also fails in year 5 with fire. (This is a bad case name as it uses default allometry. will fix that...) /glade2/scratch2/jkshuman/Fire0507_Obrienh_Saldaa_Saldal_latosa_int_default_2PFT_1x1_2dba074_f8d7693/run Similar failure message in year 5. In cesm log there is a set of "NetCDF: invalid dimension ID or name statements" followed by patch trimming followed by Solar radiation balance check errors, more patch trimming, more radiation balance check errors. Then again identifying CLM BalanceCheckMod line 543.

WARNING:: BalanceCheck, solar radiation balance error (W/m2) 334: nstep = 96938 334: errsol = -1.311063329012541E-007 330: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 330: nstep = 96938 330: errsol = -1.427682150278997E-007 529:Image PC Routine Line Source
529:cesm.exe 0000000001237DAD Unknown Unknown Unknown 529:cesm.exe 0000000000D1B432 shr_abort_modmp 114 shr_abort_mod.F90 529:cesm.exe 0000000000503CD5 abortutils_mp_end 77 abortutils.F90 529:cesm.exe 0000000000677E2D balancecheckmod_m 543 BalanceCheckMod.F90 529:cesm.exe 000000000050AF77 clm_driver_mp_clm 924 clm_driver.F90 529:cesm.exe 00000000004F9516 lnd_comp_mct_mp_l 451 lnd_comp_mct.F90 529:cesm.exe 0000000000430E14 component_modmp 688 component_mod.F90 529:cesm.exe 0000000000417D59 cime_comp_modmp 2652 cime_comp_mod.F90 529:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90 529:cesm.exe 0000000000415C5E Unknown Unknown Unknown 529:libc-2.19.so 00002AAAB190AB25 __libc_start_main Unknown Unknown 529:cesm.exe 0000000000415B69 Unknown Unknown Unknown

jkshuman commented 6 years ago

that is the right case name. Obrien Salda is the default allometry... too many iterations on this.

jkshuman commented 6 years ago

@rgknox @rosiealice I did another set of runs for single and 2PFTs for a regional run in South America. Both fails have the same set of solar radiation balance check errors. I include pieces of the cesm.log for the failed runs.

general case statement: ./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupported

1 PFT (no fire) for Grass and Trop Tree completed to year 21 with reasonable biomass and distribution. 1 PFT (Fire) for Trop Tree completed through year 21. 1 PFT (Fire) for Grass failed at year 11. (cesm.log piece below)

2 PFT (Fire) for Trop Tree and Grass failed at year 5. (cesm.log piece after the fire grass log)

/glade2/scratch2/jkshuman/Fire_Grass_1x1_2dba074_f8d7693/run Errors: clmfates_interfaceMod.F90:: reading froz_q10 217: NetCDF: Invalid dimension ID or name 217: NetCDF: Invalid dimension ID or name 217: NetCDF: Invalid dimension ID or name 217: NetCDF: Invalid dimension ID or name 217: NetCDF: Invalid dimension ID or name 217: NetCDF: Variable not found 217: NetCDF: Variable not found 0:(seq_domain_areafactinit) : min/max mdl2drv 1.00000000000000 1.00000000000000 areafact_a_ATM 0:(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00000000000000 areafact_a_ATM 102: trimming patch area - is too big 1.818989403545856E-012 109: trimming patch area - is too big 1.818989403545856E-012 467: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 467: nstep = 192742 467: errsol = -1.090609771381423E-007

(and from further within the cesm.log...) WARNING:: BalanceCheck, solar radiation balance error (W/m2) 202: nstep = 195723 202: errsol = -1.013256678561447E-007 180:Image PC Routine Line Source
180:cesm.exe 0000000001237DAD Unknown Unknown Unknown 180:cesm.exe 0000000000D1B432 shr_abort_modmp 114 shr_abort_mod.F90 180:cesm.exe 0000000000503D97 abortutils_mp_end 43 abortutils.F90 180:cesm.exe 000000000050329C lnd_import_export 419 lnd_import_export.F90 180:cesm.exe 00000000004F9557 lnd_comp_mct_mp_l 457 lnd_comp_mct.F90 180:cesm.exe 0000000000430E14 component_modmp 688 component_mod.F90 180:cesm.exe 0000000000417D59 cime_comp_modmp 2652 cime_comp_mod.F90 180:cesm.exe 0000000000430B3D MAIN__ 68 cime_driver.F90 180:cesm.exe 0000000000415C5E Unknown Unknown Unknown 180:libc-2.19.so 00002AAAB190AB25 __libc_start_main Unknown Unknown 180:cesm.exe 0000000000415B69 Unknown Unknown Unknown 180:MPT ERROR: Rank 180(g:180) is aborting with error code 1001. 180: Process ID: 70276, Host: r2i2n9, Program: /glade2/scratch2/jkshuman/Fire_Grass_1x1_2dba074_f8d7693/bld/cesm.exe 180: MPT Version: SGI MPT 2.15 12/18/16 02:58:06

/glade2/scratch2/jkshuman/Fire0507_Obrienh_Saldaa_Saldal_2PFT_1x1_2dba074_f8d7693/run

WARNING:: BalanceCheck, solar radiation balance error 330: nstep = 96938 330: errsol = -1.427682150278997E-007 529:Image PC Routine 529:cesm.exe 0000000001237DAD Unknown 529:cesm.exe 0000000000D1B432 shr_abort_modmp 529:cesm.exe 0000000000503CD5 abortutils_mp_end 529:cesm.exe 0000000000677E2D balancecheckmod_m 529:cesm.exe 000000000050AF77 clm_driver_mp_clm 529:cesm.exe 00000000004F9516 lnd_comp_mct_mp_l 529:cesm.exe 0000000000430E14 component_modmp 529:cesm.exe 0000000000417D59 cime_comp_modmp 529:cesm.exe 0000000000430B3D MAIN__ 529:cesm.exe 0000000000415C5E Unknown 529:libc-2.19.so 00002AAAB190AB25 __libc_start_main 529:cesm.exe 0000000000415B69 Unknown 529:MPT ERROR: Rank 529(g:529) is aborting with error 529: Process ID: 47973, Host: r5i4n34, Program: 529: MPT Version: SGI MPT 2.15 12/18/16 02:58:06 529: 529:MPT: --------stack traceback------- 0: memory_write: model date = 60715 0 memory = 529:MPT: Attaching to program: /proc/47973/exe, process 529:MPT: done. (W/m2) Line Source
Unknown Unknown 114 shr_abort_mod.F90 77 abortutils.F90 543 BalanceCheckMod.F90 924 clm_driver.F90 451 lnd_comp_mct.F90 688 component_mod.F90 2652 cime_comp_mod.F90 68 cime_driver.F90 Unknown Unknown Unknown Unknown Unknown Unknown code 1001. /glade2/scratch2/jkshuman/Fire0507_Obrienh_Saldaa_Saldal_2PFT_1x1_2dba074_f8d7693/bld/cesm.exe 129228.42 MB (highwater) 102.11 MB (usage) (pe= 0 comps= ATM ESP) 47973

529: gridcell longitude = 290.000000000000
529: gridcell latitude = -15.5497382198953

rgknox commented 6 years ago

@jkshuman , can you provide a link to the branch you are using, I can't find hash f8d7693

jkshuman commented 6 years ago

It is a merge between the memory leak commit and my added crown area history field. Here is a link, but this may not have the memory leak commit. I don't recall if I pushed those changes to my link. Cheyenne is still down. so I can't update at the moment. https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_sync

jkshuman commented 6 years ago

Cheyenne is still down, so putting my link to my crown area history variable branch in this issue as well. The failing runs were on a merge branch created from master branch #372 memory leak fix and my crown area branch (link below). https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft

jkshuman commented 6 years ago

I updated the sync branch with the failing branch code. https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_sync

rosiealice commented 6 years ago

Did you try the run with just the new master branch? That way we can see if the issues are caused by stuff on the branch?

2018-05-11 13:04 GMT-06:00 jkshuman notifications@github.com:

I updated the sync branch with the failing branch code. https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_sync

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-388457260, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQz3aPQrwTkDSmE-ksRjhY2pcpMLkks5txeDUgaJpZM4Tzp8E .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

jkshuman commented 6 years ago

Running 1PFT grass, 1PFT trop tree, and 2PFT all with fire on CLM4.5 (paths below) New set of runs being created with this branch (crown area history merge with 379 canopy photo fix): https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_379canopy_photo_fix

./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupported

/glade2/scratch2/jkshuman/Fire_Grass_1x1_2dba074_5dda57b /glade2/scratch2/jkshuman/Fire_Obrien_Salda_TropTree_1x1_2dba074_5dda57b /glade2/scratch2/jkshuman/Fire_Obrienh_Saldaa_Saldal_2PFT_1x1_2dba074_5dda57b

jkshuman commented 6 years ago

the crown area stuff is just a history variable, so unlikely to cause this failure? but can run with master to test that as well.

Jacquelyn Shuman, PhD Terrestrial Sciences Section National Center for Atmospheric Research PO Box 3000 Boulder, Colorado 80307-3000 USA

jkshuman@ucar.edu office: +1-303-497-1787

On Fri, May 11, 2018 at 1:10 PM, Rosie Fisher notifications@github.com wrote:

Did you try the run with just the new master branch? That way we can see if the issues are caused by stuff on the branch?

2018-05-11 13:04 GMT-06:00 jkshuman notifications@github.com:

I updated the sync branch with the failing branch code. https://github.com/jkshuman/fates/tree/hio_crownarea_si_pft_sync

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-388457260, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQz3aPQrwTkDSmE- ksRjhY2pcpMLkks5txeDUgaJpZM4Tzp8E .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive https://maps.google.com/?q=1850+Table+Mesa+Drive+%0D%0ABoulder,+Colorado,+80305+%0D%0AUSA&entry=gmail&source=g Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-388458750, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFDhpLxiKBTjX8wIrRZQnV3n-Rx6ZRbks5txeI9gaJpZM4Tzp8E .

rgknox commented 6 years ago

looks like my single site run at:

gridcell longitude = 290.000000000000 gridcell latitude = -15.5497382198953

did not generate the error after 30 years.

I will try to look through and see if I added some configuration that was different.

Run directory:

/glade2/scratch2/rgknox/jkstest-1pt-v0/run

Uses this parameter file:

/glade/u/home/rgknox/param_file_2PFT_Obrienh_Saldaa_Saldal_05042018.nc

jkshuman commented 6 years ago

this was with fire for clm45?

rgknox commented 6 years ago

I noticed this in the parameter file:

fates_leaf_xl = 0.1, 0.1, -0.3

This may be fine, it just caught my eye. xl is orientation index, which I think I recall allowing negatives. But we should double check if our formulation does.

rgknox commented 6 years ago

yeah, that parameter seems fine, false alarm

jkshuman commented 6 years ago

my runs are a 1 degree regional subset for South America. surface and domain files here:

/glade2/scratch2/jkshuman/sfcdata

Jacquelyn Shuman, PhD Terrestrial Sciences Section National Center for Atmospheric Research PO Box 3000 Boulder, Colorado 80307-3000 USA

jkshuman@ucar.edu office: +1-303-497-1787

On Fri, May 11, 2018 at 1:57 PM, Ryan Knox notifications@github.com wrote:

yeah, that parameter seems fine, false alarm

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-388470090, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFDhnXNHevVwvGv620Rs7kIJwoRe0liks5txe0rgaJpZM4Tzp8E .

rgknox commented 6 years ago

ok, thanks. New single site run on cheyenne is going, now using spit-fire.

My current guess as to what is happening is that we are running into a problem with nigh-zero biomass or leaves, which is the product of fire turning over an all grass patch? Its possible the recent bug fix addressed this, but we will see.

jkshuman commented 6 years ago

@rgknox another set of runs going with pull request 382. 1 PFT runs with fire are still going (tree at year 21, grass at year 2 - slow in queue?). 2PFT run (trop tree and grass) failed in year 6. Similar set of errors. BalanceCheckMod.f90 line 543, BalanceCheck, solar radiation balance error. /glade/scratch/jkshuman/archive/Fire_Obrienh_Saldaa_Saldal_2PFT_SA1x1_2dba074_0f0c41c/ New location: gridcell longitude = 305.000000000000
gridcell latitude = -23.0890052356021

From cesm.log WARNING:: BalanceCheck, solar radiation balance error (W/m2) 235: nstep = 119564 235: errsol = -1.108547849071329E-007 252: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 252: nstep = 119565 252: errsol = -1.065200194716454E-007 0: memory_write: model date = 71029 0 memory = 128919.57 MB (highwater) 101.85 MB (usage) (pe= 0 comps= ATM ESP) 467: trimming patch area - is too big 1.818989403545856E-012 545: trimming patch area - is too big 1.818989403545856E-012 353: trimming patch area - is too big 1.818989403545856E-012 390: trimming patch area - is too big 1.818989403545856E-012 513: trimming patch area - is too big 1.818989403545856E-012 506: trimming patch area - is too big 1.818989403545856E-012 535: trimming patch area - is too big 1.818989403545856E-012 446: trimming patch area - is too big 1.818989403545856E-012 469: trimming patch area - is too big 1.818989403545856E-012 477: trimming patch area - is too big 1.818989403545856E-012 326: trimming patch area - is too big 1.818989403545856E-012 403: trimming patch area - is too big 1.818989403545856E-012 69: trimming patch area - is too big 1.818989403545856E-012 239: trimming patch area - is too big 1.818989403545856E-012 70: trimming patch area - is too big 1.818989403545856E-012 218: trimming patch area - is too big 1.818989403545856E-012 257: trimming patch area - is too big 1.818989403545856E-012 75: trimming patch area - is too big 1.818989403545856E-012 330: trimming patch area - is too big 1.818989403545856E-012 170: trimming patch area - is too big 1.818989403545856E-012 200: trimming patch area - is too big 1.818989403545856E-012 198: trimming patch area - is too big 1.818989403545856E-012 255: trimming patch area - is too big 1.818989403545856E-012 80: trimming patch area - is too big 1.818989403545856E-012 219: trimming patch area - is too big 1.818989403545856E-012 118: trimming patch area - is too big 1.818989403545856E-012 119: trimming patch area - is too big 1.818989403545856E-012 202: >5% Dif Radn consvn error -1.05825538715178 1 2 202: diags 7.96359955072742 -54.6696896639910 38.3301532002546
202: lai_change 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: elai 0.796415587611356 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.234465085324267
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: esai 9.096157657329497E-002 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 9.398288976575598E-003 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: ftweight 1.267302001703947E-002 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: cp 6.405767903805394E-010 1 202: bc_in(s)%albgr_dif_rb(ib) 0.190858817093915
202: rhol 0.100000001490116 0.100000001490116 0.100000001490116
202: 0.449999988079071 0.449999988079071 0.349999994039536
202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 202: 0.000000000000000E+000 202: present 1 0 0 202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 331: Large Dir Radn consvn error 87300236774.1395 1 2 331: diags 35545013833.8197 -1.718567028306606E-002 -793747809365.306
331: 496278040697.993
331: lai_change 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: elai 0.776682425289442 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.227539226615268
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: esai 9.093202219977818E-002 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 9.101385150350671E-003 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: ftweight 0.143517787345916 0.000000000000000E+000 331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000 331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000 331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: cp 2.006325586387992E-009 1 331: bc_in(s)%albgr_dir_rb(ib) 0.220000000000000
331: dif ground absorption error 1 1 -2.968510966153521E+017 331: -2.968510966153521E+017 2 2 1.00000000000000
331: >5% Dif Radn consvn error 4.270016056591235E+016 1 2 331: diags 1.669646990961853E+016 -3.805783289940412E+017 2.374544661398212E+017 331: lai_change 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: elai 0.776682425289442 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.961569569355599
331: 0.000000000000000E+000 0.000000000000000E+000 0.227539226615268
331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: esai 9.093202219977818E-002 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 9.101385150350671E-003 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: ftweight 7.801052745940848E-002 0.000000000000000E+000 331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000 331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000 331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: cp 2.006325586387992E-009 1 331: bc_in(s)%albgr_dif_rb(ib) 0.220000000000000
331: rhol 0.100000001490116 0.100000001490116 0.100000001490116
331: 0.449999988079071 0.449999988079071 0.349999994039536
331: ftw 1.00000000000000 0.143517787345916 0.000000000000000E+000 331: 0.856482212654084
331: present 1 0 1 331: CAP 0.143517787345916 0.000000000000000E+000 0.856482212654084
331: there is still error after correction 1.00000000000000 1 331: 2 202: >5% Dif Radn consvn error -1.07307654594231 1 2 202: diags 8.03407121904317 -55.1147964199711 38.6409503555679
202: lai_change 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: elai 0.796415587611356 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.961509001506293
202: 0.000000000000000E+000 0.000000000000000E+000 0.234465085324267
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: esai 9.096157657329497E-002 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 9.398288976575598E-003 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: ftweight 1.267302001703947E-002 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: cp 6.405767903805394E-010 1 202: bc_in(s)%albgr_dif_rb(ib) 0.190744628923151
202: rhol 0.100000001490116 0.100000001490116 0.100000001490116
202: 0.449999988079071 0.449999988079071 0.349999994039536
202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 202: 0.000000000000000E+000 202: present 1 0 0 202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 331: energy balance in canopy 26844 , err= -11.9593662381158
331: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 331: nstep = 119588 331: errsol = -1323.30638249407
331: clm model is stopping - error is greater than 1e-5 (W/m2) 331: fsa = -7.745702732785249E+017 331: fsr = 7.745702732785236E+017 331: forc_solad(1) = 5.51145480639649
331: forc_solad(2) = 8.61256572561393
331: forc_solai(1) = 16.1417364406403
331: forc_solai(2) = 13.0406255214228
331: forc_tot = 43.3063824940735
331: clm model is stopping 331: calling getglobalwrite with decomp_index= 26844 and clmlevel= pft 331: local patch index = 26844 331: global patch index = 9516 331: global column index = 4795 331: global landunit index = 1267 331: global gridcell index = 296 331: gridcell longitude = 305.000000000000
331: gridcell latitude = -23.0890052356021
331: pft type = 1 331: column type = 1 331: landunit type = 1 331: ENDRUN: 331: ERROR in BalanceCheckMod.F90 at line 543
331:
331:

rosiealice commented 6 years ago

I feel like ftweight should not ever be >1, but here it's like 93, 143, etc. I've got a bunch of slides to do for tomorrow morning still, but that's the thing that strikes me most about this. Maybe worth checking the ftweight calculations...

2018-05-14 21:28 GMT-06:00 jkshuman notifications@github.com:

@rgknox https://github.com/rgknox another set of runs going with pull request 382. 1 PFT runs with fire are still going (tree at year 21, grass at year 2 - slow in queue?). 2PFT run (trop tree and grass) failed in year

Similar set of errors. BalanceCheckMod.f90 line 543, BalanceCheck, solar radiation balance error. /glade/scratch/jkshuman/archive/Fire_ObrienhSaldaa Saldal_2PFT_SA1x1_2dba074_0f0c41c/ New location: gridcell longitude = 305.000000000000 gridcell latitude = -23.0890052356021

From cesm.log WARNING:: BalanceCheck, solar radiation balance error (W/m2) 235: nstep = 119564 235: errsol = -1.108547849071329E-007 252: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 252: nstep = 119565 252: errsol = -1.065200194716454E-007 0: memory_write: model date = 71029 0 memory = 128919.57 MB (highwater) 101.85 MB (usage) (pe= 0 comps= ATM ESP) 467: trimming patch area - is too big 1.818989403545856E-012 545: trimming patch area - is too big 1.818989403545856E-012 353: trimming patch area - is too big 1.818989403545856E-012 390: trimming patch area - is too big 1.818989403545856E-012 513: trimming patch area - is too big 1.818989403545856E-012 506: trimming patch area - is too big 1.818989403545856E-012 535: trimming patch area - is too big 1.818989403545856E-012 446: trimming patch area - is too big 1.818989403545856E-012 469: trimming patch area - is too big 1.818989403545856E-012 477: trimming patch area - is too big 1.818989403545856E-012 326: trimming patch area - is too big 1.818989403545856E-012 403: trimming patch area - is too big 1.818989403545856E-012 69: trimming patch area - is too big 1.818989403545856E-012 239: trimming patch area - is too big 1.818989403545856E-012 70: trimming patch area - is too big 1.818989403545856E-012 218: trimming patch area - is too big 1.818989403545856E-012 257: trimming patch area - is too big 1.818989403545856E-012 75: trimming patch area - is too big 1.818989403545856E-012 330: trimming patch area - is too big 1.818989403545856E-012 170: trimming patch area - is too big 1.818989403545856E-012 200: trimming patch area - is too big 1.818989403545856E-012 198: trimming patch area - is too big 1.818989403545856E-012 255: trimming patch area - is too big 1.818989403545856E-012 80: trimming patch area - is too big 1.818989403545856E-012 219: trimming patch area - is too big 1.818989403545856E-012 118: trimming patch area - is too big 1.818989403545856E-012 119: trimming patch area - is too big 1.818989403545856E-012 202: >5% Dif Radn consvn error -1.05825538715178 1 2 202: diags 7.96359955072742 -54.6696896639910 38.3301532002546 202: lai_change 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: elai 0.796415587611356 0.000000000000000E+000 0.961509001506293 202: 0.000000000000000E+000 0.000000000000000E+000 0.961509001506293 202: 0.000000000000000E+000 0.000000000000000E+000 0.234465085324267 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: esai 9.096157657329497E-002 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 9.398288976575598E-003 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: ftweight 1.267302001703947E-002 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: cp 6.405767903805394E-010 1 202: bc_in(s)%albgr_dif_rb(ib) 0.190858817093915 202: rhol 0.100000001490116 0.100000001490116 0.100000001490116 202: 0.449999988079071 0.449999988079071 0.349999994039536 202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 202: 0.000000000000000E+000 202: present 1 0 0 202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 331: Large Dir Radn consvn error 87300236774.1395 1 2 331: diags 35545013833.8197 -1.718567028306606E-002 -793747809365.306 331: 496278040697.993 331: lai_change 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: elai 0.776682425289442 0.000000000000000E+000 0.961569569355599 331: 0.000000000000000E+000 0.000000000000000E+000 0.961569569355599 331: 0.000000000000000E+000 0.000000000000000E+000 0.227539226615268 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: esai 9.093202219977818E-002 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 9.101385150350671E-003 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: ftweight 0.143517787345916 0.000000000000000E+000 331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000 331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000 331: 0.856482212654084 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: cp 2.006325586387992E-009 1 331: bc_in(s)%albgr_dir_rb(ib) 0.220000000000000 331: dif ground absorption error 1 1 -2.968510966153521E+017 331: -2.968510966153521E+017 2 2 1.00000000000000 331: >5% Dif Radn consvn error 4.270016056591235E+016 1 2 331: diags 1.669646990961853E+016 -3.805783289940412E+017 2.374544661398212E+017 331: lai_change 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: elai 0.776682425289442 0.000000000000000E+000 0.961569569355599 331: 0.000000000000000E+000 0.000000000000000E+000 0.961569569355599 331: 0.000000000000000E+000 0.000000000000000E+000 0.227539226615268 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: esai 9.093202219977818E-002 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 3.843043064440077E-002 331: 0.000000000000000E+000 0.000000000000000E+000 9.101385150350671E-003 331: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 331: ftweight 7.801052745940848E-002 0.000000000000000E+000 331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000 331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000 331: 143.470563918829 0.000000000000000E+000 0.000000000000000E+000 331: 0.000000000000000E+000 331: cp 2.006325586387992E-009 1 331: bc_in(s)%albgr_dif_rb(ib) 0.220000000000000 331: rhol 0.100000001490116 0.100000001490116 0.100000001490116 331: 0.449999988079071 0.449999988079071 0.349999994039536 331: ftw 1.00000000000000 0.143517787345916 0.000000000000000E+000 331: 0.856482212654084 331: present 1 0 1 331: CAP 0.143517787345916 0.000000000000000E+000 0.856482212654084 331: there is still error after correction 1.00000000000000 1 331: 2 202: >5% Dif Radn consvn error -1.07307654594231 1 2 202: diags 8.03407121904317 -55.1147964199711 38.6409503555679 202: lai_change 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: elai 0.796415587611356 0.000000000000000E+000 0.961509001506293 202: 0.000000000000000E+000 0.000000000000000E+000 0.961509001506293 202: 0.000000000000000E+000 0.000000000000000E+000 0.234465085324267 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: esai 9.096157657329497E-002 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 3.849099849370675E-002 202: 0.000000000000000E+000 0.000000000000000E+000 9.398288976575598E-003 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: ftweight 1.267302001703947E-002 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 29.1624152220974 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: cp 6.405767903805394E-010 1 202: bc_in(s)%albgr_dif_rb(ib) 0.190744628923151 202: rhol 0.100000001490116 0.100000001490116 0.100000001490116 202: 0.449999988079071 0.449999988079071 0.349999994039536 202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 202: 0.000000000000000E+000 202: present 1 0 0 202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 331: energy balance in canopy 26844 , err= -11.9593662381158 331: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 331: nstep = 119588 331: errsol = -1323.30638249407 331: clm model is stopping - error is greater than 1e-5 (W/m2) 331: fsa = -7.745702732785249E+017 331: fsr = 7.745702732785236E+017 331: forc_solad(1) = 5.51145480639649 331: forc_solad(2) = 8.61256572561393 331: forc_solai(1) = 16.1417364406403 331: forc_solai(2) = 13.0406255214228 331: forc_tot = 43.3063824940735 331: clm model is stopping 331: calling getglobalwrite with decomp_index= 26844 and clmlevel= pft 331: local patch index = 26844 331: global patch index = 9516 331: global column index = 4795 331: global landunit index = 1267 331: global gridcell index = 296 331: gridcell longitude = 305.000000000000 331: gridcell latitude = -23.0890052356021 331: pft type = 1 331: column type = 1 331: landunit type = 1 331: ENDRUN: 331: ERROR in BalanceCheckMod.F90 at line 543 331: 331:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-389031443, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ3aV2BBnc0QhUSS28cWX__BsCupcks5tyktDgaJpZM4Tzp8E .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

rgknox commented 6 years ago

agreed @rosiealice , whatever is wrong, seems to be mediated by ftweight

rgknox commented 6 years ago

I will try to reproduce errors in that last post.

@jkshuman , could you post your create_case execution and any environment modifiers?

relevant parameters:

fates_paramfile = '/glade/p/work/jkshuman/FATES_data/parameter_files/param_file_2PFT_Obrienh_Saldaa_Saldal_05072018.nc'
 use_fates = .true.
 use_fates_ed_prescribed_phys = .false.
 use_fates_ed_st3 = .false.
 use_fates_inventory_init = .false.
 use_fates_logging = .false.
 use_fates_planthydro = .false.
 use_fates_spitfire = .true.
fsurdat = '/glade/scratch/jkshuman/sfcdata/surfdata_0.9x1.25_16pfts_Irrig_CMIP6_simyr2000_SA.nc'

jkshuman commented 6 years ago

ok. I have it down to days. it seems to be hung up, but I will restart from this case in debug mode and take a close look at ftweight. Going to use the 2PFT case as the 1 PFT trop tree run made it out to 51 years with fire. seems a grass and fire issue. But may try the grass single PFT as well... /glade2/scratch2/jkshuman/archive/Fire_Obrienh_Saldaa_Saldal_2PFT_SA1x1_2dba074_0f0c41c/

/glade2/scratch2/jkshuman/archive/Fire_Grass_SA_1x1_2dba074_0f0c41c/

jkshuman commented 6 years ago

path to restart files for 2PFT case: /glade/scratch/jkshuman/archive/Fire_Obrienh_Saldaa_Saldal_2PFT_SA1x1_2dba074_0f0c41c/rest

path to my script for creating the case, and relevant params below: /glade/p/work/jkshuman/FATES_data/case_fire_TreeGrass_tropics

./create_newcase --case ${casedir}${CASE_NAME} --res f09_f09 --compset 2000_DATM%GSWP3v1_CLM45%FATES_SICE_SOCN_RTM_SGLC_SWAV --run-unsupp\ orted ./xmlchange STOP_OPTION=ndays ./xmlchange STOP_N=1 ./xmlchange REST_OPTION=ndays ./xmlchange RESUBMIT=50

./xmlchange JOB_WALLCLOCK_TIME=1:00

./xmlchange DATM_MODE=CLMGSWP3v1 ./xmlchange DATM_CLMNCEP_YR_ALIGN=1985 ./xmlchange DATM_CLMNCEP_YR_START=1985 ./xmlchange DATM_CLMNCEP_YR_END=2004

./xmlchange RTM_MODE=NULL ./xmlchange ATM_DOMAIN_FILE=domain.lnd.fv0.9x1.25_gx1v6.SA.nc ./xmlchange ATM_DOMAIN_PATH=/glade/scratch/jkshuman/sfcdata ./xmlchange LND_DOMAIN_FILE=domain.lnd.fv0.9x1.25_gx1v6.SA.nc ./xmlchange LND_DOMAIN_PATH=/glade/scratch/jkshuman/sfcdata ./xmlchange CLM_USRDAT_NAME=SAmerica

./xmlchange NTASKS_ATM=-1 ./xmlchange NTASKS_CPL=-15 ./xmlchange NTASKS_GLC=-15 ./xmlchange NTASKS_OCN=-15 ./xmlchange NTASKS_WAV=-15 ./xmlchange NTASKS_ICE=-15 ./xmlchange NTASKS_LND=-15 ./xmlchange NTASKS_ROF=-15 ./xmlchange NTASKS_ESP=-15

jkshuman commented 6 years ago

relevant parameters in user_nl_clm are as you have them listed. above.

rosiealice commented 6 years ago

I think we need to look at why ftweight is >1. ftweight is the same as canopy_area_profile, which is set on: https://github.com/NGEET/fates/blob/e522527035c0061f0d31c265e4ccc4dc94b7d3cb/biogeochem/EDCanopyStructureMod.F90#L1337

I'd put a write statement there to catch anything going over 1... (or a slightly bigger number, so we don't get all these 10^-12 edge cases), and then print out the c_area, total_canopy_area, etc. if that happens. If you've got the runs down to days it shouldn't take long to find the culprit there. I'd be quite surprised if the ftweight wasn't the culprit here.

rgknox commented 6 years ago

So I was able to trigger an error using just cell -20.09N 305E, and your 2PFT case. The fail happens on April 17th of the 7th year.

FATES Dynamics:    7-04-17

0:forrtl: error (73): floating divide by zero
0:Image              PC                Routine            Line        Source             
0:cesm.exe           0000000003E1CF91  Unknown               Unknown  Unknown
0:cesm.exe           0000000003E1B0CB  Unknown               Unknown  Unknown
0:cesm.exe           0000000003DCCBC4  Unknown               Unknown  Unknown
0:cesm.exe           0000000003DCC9D6  Unknown               Unknown  Unknown
0:cesm.exe           0000000003D4C4B9  Unknown               Unknown  Unknown
0:cesm.exe           0000000003D58AE9  Unknown               Unknown  Unknown
0:libpthread-2.19.s  00002AAAAFAC1870  Unknown               Unknown  Unknown
0:cesm.exe           0000000002B8581B  dynpatchstateupda         189  dynPatchStateUpdaterMod.F90
0:cesm.exe           0000000000A1CCCC  dynsubgriddriverm         284  dynSubgridDriverMod.F90
0:cesm.exe           000000000087E555  clm_driver_mp_clm         306  clm_driver.F90
0:cesm.exe           000000000084B5B9  lnd_comp_mct_mp_l         451  lnd_comp_mct.F90
0:cesm.exe           000000000046BD2D  component_mod_mp_         688  component_mod.F90
0:cesm.exe           000000000043C474  cime_comp_mod_mp_        2652  cime_comp_mod.F90
0:cesm.exe           00000000004543B7  MAIN__                     68  cime_driver.F90
0:cesm.exe           0000000000415A5E  Unknown               Unknown  Unknown
0:libc-2.19.so       00002AAAB190AB25  __libc_start_main     Unknown  Unknown
0:cesm.exe           0000000000415969  Unknown               Unknown  Unknown
-1:MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
-1: aborting job
MPT: Received signal 6

jkshuman commented 6 years ago

That’s interesting. My run with rest option set to days is still going into month 9 day 18 last I checked...

Progress

On Tue, May 15, 2018 at 5:15 PM Ryan Knox notifications@github.com wrote:

So I was able to trigger an error using just cell -20.09N 305E.

FATES Dynamics: 7-04-17

0:cesm.exe 0000000002B8581B dynpatchstateupda 189 dynPatchStateUpdaterMod.F90 ``

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-389344073, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFDhtuqiX7CDhQhLXtv9ugHUwaxA5LYks5ty2GWgaJpZM4Tzp8E .

-- Jacquelyn Shuman Terrestrial Sciences Section NCAR

jkshuman commented 6 years ago

Got it to day of failure (October 30 year 7). Will kick it off in debug to see if I get the same error as you did @rgknox (similar error as previous, and same location: long = 305 lat = -23.089 from cesm.log bc_in(s)%albgr_dif_rb(ib) 0.220000000000000
331: rhol 0.100000001490116 0.100000001490116 0.100000001490116
331: 0.449999988079071 0.449999988079071 0.349999994039536
331: ftw 1.00000000000000 0.143517787251814 0.000000000000000E+000 331: 0.856482212748186
331: present 1 0 1 331: CAP 0.143517787251814 0.000000000000000E+000 0.856482212748186
331: there is still error after correction 1.00000000000000 1 331: 2 202: >5% Dif Radn consvn error -1.07341422635010 1 2 202: diags 8.03574910457470 -55.1258110560189 38.6485853190346
202: lai_change 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: elai 0.796415126488024 0.000000000000000E+000 0.961509014797645
202: 0.000000000000000E+000 0.000000000000000E+000 0.961509014797645
202: 0.000000000000000E+000 0.000000000000000E+000 0.234466930897031
202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: esai 9.096157669642455E-002 0.000000000000000E+000 3.849098520235514E-002 202: 0.000000000000000E+000 0.000000000000000E+000 3.849098520235514E-002 202: 0.000000000000000E+000 0.000000000000000E+000 9.398356961483976E-003 202: 0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000 202: ftweight 1.267295049486910E-002 0.000000000000000E+000 202: 29.1628509591272 0.000000000000000E+000 0.000000000000000E+000 202: 29.1628509591272 0.000000000000000E+000 0.000000000000000E+000 202: 29.1628509591272 0.000000000000000E+000 0.000000000000000E+000 202: 0.000000000000000E+000 202: cp 6.410821458268472E-010 1 202: bc_in(s)%albgr_dif_rb(ib) 0.190743513017422
202: rhol 0.100000001490116 0.100000001490116 0.100000001490116
202: 0.449999988079071 0.449999988079071 0.349999994039536
202: ftw 1.00000000000000 1.00000000000000 0.000000000000000E+000 202: 0.000000000000000E+000 202: present 1 0 0 202: CAP 1.00000000000000 0.000000000000000E+000 0.000000000000000E+000 331: energy balance in canopy 26844 , err= -11.9601284804630
331: WARNING:: BalanceCheck, solar radiation balance error (W/m2) 331: nstep = 119588 331: errsol = 724.693617505926
331: clm model is stopping - error is greater than 1e-5 (W/m2) 331: fsa = -7.745702333124070E+017 331: fsr = 7.745702333124078E+017 331: forc_solad(1) = 5.51145480639649
331: forc_solad(2) = 8.61256572561393
331: forc_solai(1) = 16.1417364406403
331: forc_solai(2) = 13.0406255214228
331: forc_tot = 43.3063824940735
331: clm model is stopping 331: calling getglobalwrite with decomp_index= 26844 and clmlevel= pft 331: local patch index = 26844 331: global patch index = 9516 331: global column index = 4795 331: global landunit index = 1267 331: global gridcell index = 296 331: gridcell longitude = 305.000000000000
331: gridcell latitude = -23.0890052356021
331: pft type = 1 331: column type = 1 331: landunit type = 1 331: ENDRUN: 331: ERROR in BalanceCheckMod.F90 at line 543
331:
331:
331:

rgknox commented 6 years ago

Here is a print message at the time of fail, this is from subroutine set_new_weights() in dynPatchStateUpdaterMod.F90.

The problem is triggered because from the second-to-last step to the last, that bare-ground patch goes to a weight of zero, and somehow its old (previous) area was negative?

print*,bounds%begp,bounds%endp,p,this%pwtgcell_old(p),this%pwtgcell_new(p)

0:           1          32           3  0.998904682346343     0.998904682346344     
0:           1          32           3  0.998904682346344     0.998904682346344     
0:           1          32           3  0.998904682346344     0.998904682346344     
0:           1          32           1 -2.218013955499719E-016  0.000000000000000E+000

 subroutine set_new_weights(this, bounds)
    !                                                                                                                                                                                        
    ! !DESCRIPTION:                                                                                                                                                                          
    ! Set subgrid weights after dyn subgrid updates                                                                                                                                          
    !                                                                                                                                                                                        
    ! !USES:                                                                                                                                                                                 
    !                                                                                                                                                                                        
    ! !ARGUMENTS:                                                                                                                                                                            
    class(patch_state_updater_type), intent(inout) :: this
    type(bounds_type), intent(in) :: bounds
    !                                                                                                                                                                                        
    ! !LOCAL VARIABLES:                                                                                                                                                                      
    integer :: p

    character(len=*), parameter :: subname = 'set_new_weights'
    !-----------------------------------------------------------------------                                                                                                                 

    do p = bounds%begp, bounds%endp
       this%pwtgcell_new(p) = patch%wtgcell(p)
       this%dwt(p) = this%pwtgcell_new(p) - this%pwtgcell_old(p)
       if (this%dwt(p) > 0._r8) then
          print*,bounds%begp,bounds%endp,p,this%pwtgcell_old(p),this%pwtgcell_new(p)
          this%growing_old_fraction(p) = this%pwtgcell_old(p) / this%pwtgcell_new(p)
          this%growing_new_fraction(p) = this%dwt(p) / this%pwtgcell_new(p)
       else
          ! These values are unused in this case, but set them to something reasonable for                                                                                                   
          ! safety. (We could set them to NaN, but that requires a more expensive                                                                                                            
          ! subroutine call, using the shr_infnan_mod infrastructure.)                                                                                                                       
          this%growing_old_fraction(p) = 1._r8
          this%growing_new_fraction(p) = 0._r8
       end if
    end do

  end subroutine set_new_weights

rgknox commented 6 years ago

The interface call wrap_update_hlmfates_dyn(), in clmfates_interfaceMod.F90, is responsible for calculating these weights.

We sum up the canopy fractions, via this output boundary condition:

this%fates(nc)%bc_out(s)%canopy_fraction_pa(1:npatch)

But if this sum is above 1, which it shouldn't be, we will have problems, and calculate a negative bare-patch size. Somehow that is happening in this run. I put a break-point where this endrun used to be:

https://github.com/ESCOMP/ctsm/blob/master/src/utils/clmfates_interfaceMod.F90#L830

rgknox commented 6 years ago

I think one bug is that we are not zero'ing out bc_out(s)%canopy_fraction_pa(1:npatch) in the subroutine that is filling it update_hlm_dynamics() . So if we shrink in total number of patches, we have an extra index that is contributing to total patch area. I will test this.

rgknox commented 6 years ago

actually, that probably wasn't the problem... although zero'ing would had been better, we should be only passing the used indexes in that array...

rosiealice commented 6 years ago

Are we sure that the bug is fire specific? Has it shown up in any non-fire runs @jkshuman? If is it fire, my suspicion might be to do with how the model handles completely burned patches.

On Wed, May 16, 2018, 2:41 PM Ryan Knox notifications@github.com wrote:

actually, that probably wasn't the problem... although zero'ing would had been better, we should be only passing the used indexes in that array...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-389658540, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ52GXweSmW15nLzuOdoDmTxjrmiEks5tzI8DgaJpZM4Tzp8E .

jkshuman commented 6 years ago

I have been focusing on the fire runs. With the updates to master, and continued testing the fail still occurs for grass and for tree/grass runs with fire. I had a tree fire run which completed through year 51 with reasonable biomass. My 2PFT debug fire run is in queue still, so no update there.

With grass the difference is that when it burns, it burns completely. So, this could be a response to the grass flammability specifically and, as @rosiealice said, completely burned patches.

rgknox commented 6 years ago

For the problem I'm currently working through (which may or may not be related to what is ultimately killing Jackie's runs), one problem is that total_canopy_area is exceeding patch area. We currently don't force total_canopy_area to be equal to or less than patch area.

I'm also noticing that when we do canopy promotion/demotion, that we have a fairly relaxed tolerance on layer area exeedance of patch area: 1e-4.

I'm wondering if grasses give the canopy demotion/promotion scheme a particularly challenging time at layering? Maybe in this specific case we are left with not-so precise canopy area, which is creating weirdness?

rgknox commented 6 years ago

Here is an error log that I think corroborates with the ftweight issue. During leaf_area_profile(), we construct several canopy-layer x pft x leaf-layer arrays. cpatch%canopy_area_profile(cl,ft,iv) is converted directly into ftweight. We have a few checks in the scheme, which can be switched on, one of which fails gracefully, if canopy_area_profile exceeds 1.0 for any given layer.

FATES: A canopy_area_profile exceeded 1.0
 cl:            1
 iv:            1
 sum(cpatch%canopy_area_profile(cl,:,iv)):    1.65653669059244     
 FATES: cohorts in layer cl =            1  0.376936443831203     
  7.401777278905496E-009  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           3  0.274264111110705     
 FATES: cohorts in layer cl =            1   4.47710468466018     
  1.069014260600514E-009  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1  3.961106027654241E-002
 FATES: cohorts in layer cl =            1   4.79421520149869     
  5.313109854499176E-010  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1  1.968710076741488E-002
 FATES: cohorts in layer cl =            1   5.13024998876371     
  6.459332537834644E-010  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1  2.393429348254634E-002
 FATES: cohorts in layer cl =            1   5.79933797252383     
  3.505819861862652E-008  2.698777192878076E-008  2.698777192878076E-008
 ED: fracarea           1   1.29904012495523

In this case, we have a few cohorts contributing crown area to the offending layer, layer 1. Layer 1 is also the top layer, and it should be assumed there is an understory layer also. The cohorts appear to be normal, no nans, no garbage values... It is a small patch in terms of area, and it has a combination of PFT1 and PFT 3 in that layer.

Note that the area fraction of the last cohort is 130% of the area. I'm not sure why the other cohorts are sharing the top layer (cl==1) with it, if this cohort, which is the largest, is filling that layer completely. This is particularly strange/wrong because we have grasses sharing that layer with a couple of 5 cm cohorts.

I'm wondering if this is a precision problem, as indicated in a post above. The area on this patch is very small, but large enough to keep. Although, the promotion/demotion precision is about 4 orders of magnitude larger than the size of the patch...

jkshuman commented 6 years ago

New runs using 1) rgknox promotion/demotion updates PR 388, 2) updated API 4.0.0, 3) updated CTSM changes. Two runs: one using clm45 or clm5 with 2PFTs (TropTree and Grass) and active fire.

clm45 completed to year 63 and still running, in queue at the moment. /glade2/scratch2/jkshuman/archive/Fire_rgknox_area_fixes_clm45_2PFT_1x1_692ba82_992e968/lnd/hist

clm5 failed in year 6 with error in EdPatchDynamicsMod.F90 associated with high fire area and patch trimming. /glade2/scratch2/jkshuman/Fire_rgknox-area-fixes_2PFT_1x1_692ba82_992e968/run

from cesm.log very high fire areas 0.983208971507476 0.983208971507476
413: Projected Canopy Area of all FATES patches 413: cannot exceed 1.0 517: trimming patch area - is too big 1.818989403545856E-012 570: trimming patch area - is too big 1.818989403545856E-012 533: trimming patch area - is too big 1.818989403545856E-012 110: trimming patch area - is too big 1.818989403545856E-012 110: patch area correction produced negative area 10000.0000000000
110: 1.818989403545856E-012 -4.939832763539551E-013 61: trimming patch area - is too big 1.818989403545856E-012 443: trimming patch area - is too big 1.818989403545856E-012 110: ENDRUN: 110: ERROR in EDPatchDynamicsMod.F90 at line 722
110:
110:
110:
110:
110:
110:
110: ERROR: Unknown error submitted to shr_abort_abort. 431: Projected Canopy Area of all FATES patches 431: cannot exceed 1.0

rgknox commented 6 years ago

@jkshuman , that new fail is an error check that I put into that branch you are currently testing.

What happened is that the model determined that the total patch area exceeded 10,000 m2, and so it simply removes the excess from one of it's patches. But, we have been removing it from the oldest patch. HOwever, up until now, we have never checked to see if that patch has the area to donate.

This can be solved by removing the area from the largest patch, instead of the oldest patch.

I will make a correction and update the branch.

rgknox commented 6 years ago

Updated the branch. Here is the change:

https://github.com/NGEET/fates/pull/388/commits/e85b681462529e20406a210a67e25325669cb1cf

@jkshuman , I will fire off some tests.

rgknox commented 6 years ago

hold a moment before testing though, it needs a quick tweak, forgot to declare "nearzero"

rosiealice commented 6 years ago

HI Ryan,

Thanks for this. Should we have a call, or hold off until the tests go?

2018-06-06 12:51 GMT-06:00 Ryan Knox notifications@github.com:

hold a moment before testing though, it needs a quick tweak, forgot to declare "nearzero"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-395175440, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ2X_iK53oxccs2RDVPunqglPUVxWks5t6CTEgaJpZM4Tzp8E .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

rgknox commented 6 years ago

@jkshuman @rosiealice and I had a review and discussion of changes in PR #388. Added some updates to code per our discussion. @jkshuman I'm going to pass it through the regression tests now.

jkshuman commented 6 years ago

Revising this to correct my mistaken runs from earlier. Confirmed that the branch code pulled in the correct changes from rgknox repo. Updated code with more rgknox-area-fixes (commit 658064e) and ctsm changes. Similar setup CLM45 and clm5 with active fire and 2PFTs (trop tree and grass) for South America region. CLM5 successfully running into year 18, and still going... CLM45 successfully running into year 20, and still going...

clm5: /glade/scratch/jkshuman/archive/Fire_rgknox_areafixes_0607_2PFT_1x1_fdce2b2_26542ea/ clm45:/glade/scratch/jkshuman/archive/Fire_rgknox_areafixes_0607_clm45_2PFT_1x1_fdce2b2_26542ea/

jkshuman commented 6 years ago

Runs are up to year 92 for clm5 and year 98 for clm45. I am going to call this closed, and open a new issue if anything else comes up as the code has diverged since opening this... To summarize: fixes included pull requests PR382 and PR388 and @rgknox fixes in repo branches for fates and ctsm. ctsm branch from rgknox_ctsm_repo-protectbaresoilfrac fates branch from rgknox-area-fix merged with master sci.1.14.0_api.4.0.0

branch details for ctsm and fates below.

fates git log details: 26542ea (HEAD, rgknox-areafix-0607_api4.0.0) Merge branch 'rgknox-area-fixes' into rgknox-areafix-0607_api4.0.0 ce689da (rgknox-area-fixes) Merge branch 'rgknox-area-fixes' of https://github.com/rgknox/fates into rgknox-area-fixes 658064e (rgknox_repo/rgknox-area-fixes) Updated some comments, added back protections on patch canopy areas exceeding 1 during the output boundary condition preparations. c357399 Merge branch 'rgknox-area-fixes' of github.com:rgknox/fates into rgknox-area-fixes e85b681 Fixed area checking logic on their sum to 10k 0f2003b Merge remote-tracking branch 'rgknox_repo/rgknox-area-fixes' into rgknox-area-fixes 34bfcdb Resolved conflict in EDCanopyStructureMod, used HEAD over master 5e92e69 (master) Merge remote-tracking branch 'ngeet_repo/master' 14aeb4f (tag: sci.1.14.0_api.4.0.0, ngeet_repo/master) Merge pull request #381 from rgknox/rgknox-soildepth-clm5

ctsm git log details: fdce2b2 (HEAD, rgknox_ctsm_repo/rgknox-fates-protectbaresoilfrac, rgknox-fates-protectbaresoilfrac, fates_next_api_rgknox_protectbaresoilfrac) Protected fates calculation of bare-soil area to not go below 0 692ba82 (origin/fates_next_api, fates_next_api) Merge pull request #375 from rgknox/rgknox-fates-varsoildepth 1cdd0e6 Merge pull request #390 from ckoven/fateshistdims 8eb90b1 (rgknox_ctsm_repo/rgknox-fates-varsoildepth) Changed a 1.0 r4 to r8 e9b7b68 Updating fates external to sci.1.14.0_api.4.0.0

rosiealice commented 6 years ago

Great !

Le ven. 8 juin 2018 à 13:25, jkshuman notifications@github.com a écrit :

Runs are up to year 92 for clm5 and year 98 for clm45. I am going to call this closed, and open a new issue if anything else comes up as the code has diverged since opening this... To summarize: fixes included pull requests PR382 and PR388 and @rgknox https://github.com/rgknox fixes in repo branches for fates and ctsm. ctsm branch from rgknox_ctsm_repo-protectbaresoilfrac fates branch from rgknox-area-fix merged with master sci.1.14.0_api.4.0.0

branch details for ctsm and fates below.

fates git log details: 26542ea (HEAD, rgknox-areafix-0607_api4.0.0) Merge branch 'rgknox-area-fixes' into rgknox-areafix-0607_api4.0.0 ce689da (rgknox-area-fixes) Merge branch 'rgknox-area-fixes' of https://github.com/rgknox/fates into rgknox-area-fixes 658064e https://github.com/NGEET/fates/commit/658064ebdc5cd52ea7aed9ffd8385e4745b5b5bb (rgknox_repo/rgknox-area-fixes) Updated some comments, added back protections on patch canopy areas exceeding 1 during the output boundary condition preparations. c357399 https://github.com/NGEET/fates/commit/c357399a047793fd77ad07af35742db88be89cc5 Merge branch 'rgknox-area-fixes' of github.com:rgknox/fates into rgknox-area-fixes e85b681 https://github.com/NGEET/fates/commit/e85b681462529e20406a210a67e25325669cb1cf Fixed area checking logic on their sum to 10k 0f2003b Merge remote-tracking branch 'rgknox_repo/rgknox-area-fixes' into rgknox-area-fixes 34bfcdb https://github.com/NGEET/fates/commit/34bfcdb4ac7e121e139b38539c80501489c9dca2 Resolved conflict in EDCanopyStructureMod, used HEAD over master 5e92e69 (master) Merge remote-tracking branch 'ngeet_repo/master' 14aeb4f https://github.com/NGEET/fates/commit/14aeb4fb66c7f9291eab70e3d9779b837314ff83 (tag: sci.1.14.0_api.4.0.0, ngeet_repo/master) Merge pull request #381 https://github.com/NGEET/fates/pull/381 from rgknox/rgknox-soildepth-clm5

ctsm git log details: fdce2b2 (HEAD, rgknox_ctsm_repo/rgknox-fates-protectbaresoilfrac, rgknox-fates-protectbaresoilfrac, fates_next_api_rgknox_protectbaresoilfrac) Protected fates calculation of bare-soil area to not go below 0 692ba82 (origin/fates_next_api, fates_next_api) Merge pull request #375 https://github.com/NGEET/fates/pull/375 from rgknox/rgknox-fates-varsoildepth 1cdd0e6 Merge pull request #390 https://github.com/NGEET/fates/issues/390 from ckoven/fateshistdims 8eb90b1 (rgknox_ctsm_repo/rgknox-fates-varsoildepth) Changed a 1.0 r4 to r8 e9b7b68 Updating fates external to sci.1.14.0_api.4.0.0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/378#issuecomment-395864233, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQzzI16Bn62hSpUdkCJ22Ewlve8S3ks5t6s-5gaJpZM4Tzp8E .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

NGEET / fates

Balance Check failure in fire runs #378

--

--

--

--

--