ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
310 stars 314 forks source link

New CN matrix fails with single point sites with the new ctsm5.3 datasets. #2780

Open ekluzek opened 2 months ago

ekluzek commented 2 months ago

Since ctsm5.2.dev175 to ctsm5.3.0 we've been running tests with MIMICS with above ground CN matrtix that have been passing. The test is SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn. This has the soil CN matrix off (because MIMICS is non-linear), but above ground CN matrix on (use_soil_matrixcn = .false. use_matrixcn = .true.).

There are two reasons for doing this test:

  1. Hopefully get MIMICS to spinup faster with above ground matrix on
  2. More extensive testing of Matrix for an edge case where it might fail easier

The hope for "1" was especially there as we weren't finding methods to speed up the spinup of MIMICS. The test did pass for 30 tags, and just started failing in ctsm5.3.0 with the following type of error in the log files:

lnd.log:

 hist_htapes_wrapup : Closing local history file ./SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn.20240923_125029_ialh14.clm2.h1.0001-01-01-28800.nc at nstep =           16

(shr_strdata_readstrm) reading file ub: /glade/campaign/cesm/cesmdata/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720.lnfm_Total_c160825.nc       7
 ERROR: ERROR in /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90 at line 1246

cesm.log:

dec0996.hsn.de.hpc.ucar.edu 0:  ERROR: ERROR in /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90 at line 1246
dec0996.hsn.de.hpc.ucar.edu 0: #0  0x12c3b50 in __shr_abort_mod_MOD_shr_abort_backtrace
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_abort_mod.F90:104
dec0996.hsn.de.hpc.ucar.edu 0: #1  0x12c3c13 in __shr_abort_mod_MOD_shr_abort_abort
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_abort_mod.F90:61
dec0996.hsn.de.hpc.ucar.edu 0: #2  0x131f9c8 in __shr_assert_mod_MOD_shr_assert
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_assert_mod.F90.in:95
dec0996.hsn.de.hpc.ucar.edu 0: #3  0xe38814 in __sparsematrixmultiplymod_MOD_spmp_abc
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90:1246
dec0996.hsn.de.hpc.ucar.edu 0: #4  0x8e97db in __cnvegmatrixmod_MOD_cnvegmatrix
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNVegMatrixMod.F90:1509
dec0996.hsn.de.hpc.ucar.edu 0: #5  0x10466ef in __cndrivermod_MOD_cndriverleaching
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNDriverMod.F90:1098
dec0996.hsn.de.hpc.ucar.edu 0: #6  0x92a6b2 in __cnvegetationfacade_MOD_ecosystemdynamicspostdrainage
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNVegetationFacade.F90:1125
dec0996.hsn.de.hpc.ucar.edu 0: #7  0x5d7ed6 in __clm_driver_MOD_clm_drv
dec0996.hsn.de.hpc.ucar.edu 0:  at /glade/work/erik/ctsm_worktrees/answer_changes/src/main/clm_driver.F90:1119

The line it fails on from above is the SHR_ASSERT_FL in this section of code in SparseMatrixMultiplyMod.F90:

    if(present(num_actunit_C))then
       if(num_actunit_C < 0)then
          write(iulog,*) "error: num_actunit_C cannot be less than 0"
          call endrun( subname//" ERROR: bad value for num_actunit_C" )
          return
       end if
       if(.not. present(filter_actunit_C))then
          write(iulog,*) "error: num_actunit_C is presented but filter_actunit_C is missing"
          call endrun( subname//" ERROR: missing required optional arguments" )
          return
       end if
       SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__)
    end if

The call in CNVegMatrixMod.F90 is here:

         if(num_actfirep .eq. 0 .and. nthreads < 2)then
            call AKallvegc%SPMP_AB(num_soilp,filter_soilp,AKphvegc,AKgmvegc,list_ready_phgmc,list_A=list_phc_phgm,list_B=list_gmc_phgm,&
                 NE_AB=NE_AKallvegc,RI_AB=RI_AKallvegc,CI_AB=CI_AKallvegc)
         else
            call AKallvegc%SPMP_ABC(num_soilp,filter_soilp,AKphvegc,AKgmvegc,AKfivegc,list_ready_phgmfic,list_A=list_phc_phgmfi,&
                 list_B=list_gmc_phgmfi,list_C=list_fic_phgmfi,NE_ABC=NE_AKallvegc,RI_ABC=RI_AKallvegc,CI_ABC=CI_AKallvegc,&
                 use_actunit_list_C=.True.,num_actunit_C=num_actfirep,filter_actunit_C=filter_actfirep)
         end if

Definition of done:

ekluzek commented 2 months ago

This is the only test we have for mimics_matrixcn. It's also possible that the tests that passed would fail if run out far enough.

ekluzek commented 2 months ago

Here's the note about this test when it was added.

https://github.com/ESCOMP/CTSM/pull/640#issuecomment-1074302305

I'm also doing some longer and different tests in ctsm5.2.028 to see the test just happened to pass since it was too short. As well as making sure the same test works without MIMCS.

ekluzek commented 2 months ago

Longer tests and tests at f10 in ctsm5.2.028 seem to be fine.

SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn SMS_D.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn SMS_D_Lm1.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn SMS_Ly2.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn

So maybe there is something specific about this with ctsm5.3.0 datasets.

We'll mark this as an expected fail for now though.

ekluzek commented 2 months ago

The other test that fails in the same way is:

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn

slevis-lmwg commented 2 months ago

...and SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.izumi_nag.clm-default--clm-NEON-HARV--clm-matrixcnOn

slevis-lmwg commented 2 months ago

My gut feeling is that these tests need new finidat files, based on past experiences where CNmatrix has crashed with one finidat and not with another (#2592).

E.g. the nearest neighbor from the finidat may not contain the right pft combinations needed for these single-point simulations.

slevis-lmwg commented 2 months ago

In one of the failing tests, I changed finidat from ctsm52026_f09_pSASU.clm2.r.0421-01-01-00000.nc to clmi.f19_interp_from.I1850Clm50BgcCrop-ciso.1366-01-01.0.9x1.25_gx1v7_simyr1850_c240223.nc and the test failed in a different timestep.

Next I want to try setting finidat to the interpolated file saved in .../tests_0923-141750de/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn.GC.0923-141750de_gnu/run/init_generated_files/ Hmm, but that may do nothing to help. I may need to generate a new finidat for this point starting from a cold start simulation.

ekluzek commented 1 month ago

A broader question we wonder here (@slevis-lmwg and I) for the group to assess: (discussed at CTSM SE Oct/10th/2024)

wwieder commented 1 month ago

maybe matrix tests always need to start from a cold start? if you're running matrix, then by definition you're doing a spinup.

ekluzek commented 1 month ago

I updated the questions above, from the mornings discussion.

slevis-lmwg commented 1 month ago

Troubleshooting suggests that my gut feeling was wrong.

SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn started cold all along and it failed regardless, so I tried the following: I turned off matrixcn and ran the case to generate a restart file. Then I turned on matrixcn and set finidat to this restart file. The simulation failed in the same line as before.

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn never started cold. I turned off matrix and generated a restart file. Then I turned on SASU and set finidat to this restart file. The simulation failed in the same line as before.

1x1 matrix tests that pass:

ERS_Lm54_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly--clm-matrixcnOn
ERS_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianRs.izumi_intel.clm-cropMonthOutput--clm-matrixcnOn_ignore_warnings
ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_intel.clm-cropMonthlyNoinitial--clm-matrixcnOn.GC.1014-115134iz_int
slevis-lmwg commented 1 month ago

Trying a Clm6 version and Clm6 DEBUG version of the first in the above list of already passing tests:

PASS ERS_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
PASS ERS_D_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup

and non-DEBUG versions of the failing tests:

PASS SMS_Ld10_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
PASS SMS.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn

So DEBUG must be uncovering a problem in these two. I will think about what I want to try next...

slevis-lmwg commented 1 month ago

I added diagnostic write-statements just before the error gets triggered in SparseMatrixMultiplyMod.F90 line 1246: SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__) and both failing tests fail when they encounter size(filter_actunit_C) = num_actunit_C This seems like a non-dealbreaker to me, so I changed the ASSERT to ">=" The equality gets the currently failing tests to pass without triggering other problems.

slevis-lmwg commented 1 month ago

@ekluzek I will run this by you before I open a PR with this code change.

My branch is in this directory: /glade/work/slevis/git/LMWG_dev8 and open the PR with git push -u slevis-lmwg fix_1x1_matrix_fails

ekluzek commented 1 month ago

@slevis-lmwg that's correct the inequality should be >= rather than just >. One point there is to just make sure the array size isn't too small. The array must've been larger all the time previously. I'd have to think about why that's the case...

I'm glad you were able to figure that out.