Open ekluzek opened 2 months ago
This is the only test we have for mimics_matrixcn. It's also possible that the tests that passed would fail if run out far enough.
Here's the note about this test when it was added.
https://github.com/ESCOMP/CTSM/pull/640#issuecomment-1074302305
I'm also doing some longer and different tests in ctsm5.2.028 to see the test just happened to pass since it was too short. As well as making sure the same test works without MIMCS.
Longer tests and tests at f10 in ctsm5.2.028 seem to be fine.
SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn SMS_D.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn SMS_D_Lm1.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn SMS_Ly2.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn
So maybe there is something specific about this with ctsm5.3.0 datasets.
We'll mark this as an expected fail for now though.
The other test that fails in the same way is:
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
...and SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.izumi_nag.clm-default--clm-NEON-HARV--clm-matrixcnOn
My gut feeling is that these tests need new finidat files, based on past experiences where CNmatrix has crashed with one finidat and not with another (#2592).
E.g. the nearest neighbor from the finidat may not contain the right pft combinations needed for these single-point simulations.
In one of the failing tests, I changed finidat from
ctsm52026_f09_pSASU.clm2.r.0421-01-01-00000.nc
to
clmi.f19_interp_from.I1850Clm50BgcCrop-ciso.1366-01-01.0.9x1.25_gx1v7_simyr1850_c240223.nc
and the test failed in a different timestep.
Next I want to try setting finidat to the interpolated file saved in
.../tests_0923-141750de/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn.GC.0923-141750de_gnu/run/init_generated_files/
Hmm, but that may do nothing to help. I may need to generate a new finidat for this point starting from a cold start simulation.
A broader question we wonder here (@slevis-lmwg and I) for the group to assess: (discussed at CTSM SE Oct/10th/2024)
maybe matrix tests always need to start from a cold start? if you're running matrix, then by definition you're doing a spinup.
I updated the questions above, from the mornings discussion.
Troubleshooting suggests that my gut feeling was wrong.
SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn
started cold all along and it failed regardless, so I tried the following:
I turned off matrixcn and ran the case to generate a restart file. Then I turned on matrixcn and set finidat to this restart file. The simulation failed in the same line as before.
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
never started cold. I turned off matrix and generated a restart file. Then I turned on SASU and set finidat to this restart file. The simulation failed in the same line as before.
1x1 matrix tests that pass:
ERS_Lm54_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly--clm-matrixcnOn
ERS_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianRs.izumi_intel.clm-cropMonthOutput--clm-matrixcnOn_ignore_warnings
ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_intel.clm-cropMonthlyNoinitial--clm-matrixcnOn.GC.1014-115134iz_int
Trying a Clm6 version and Clm6 DEBUG version of the first in the above list of already passing tests:
PASS ERS_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
PASS ERS_D_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
and non-DEBUG versions of the failing tests:
PASS SMS_Ld10_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
PASS SMS.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn
So DEBUG must be uncovering a problem in these two. I will think about what I want to try next...
I added diagnostic write-statements just before the error gets triggered in SparseMatrixMultiplyMod.F90 line 1246:
SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__)
and both failing tests fail when they encounter
size(filter_actunit_C) = num_actunit_C
This seems like a non-dealbreaker to me, so I changed the ASSERT to ">="
The equality gets the currently failing tests to pass without triggering other problems.
@ekluzek I will run this by you before I open a PR with this code change.
My branch is in this directory: /glade/work/slevis/git/LMWG_dev8 and open the PR with git push -u slevis-lmwg fix_1x1_matrix_fails
@slevis-lmwg that's correct the inequality should be >= rather than just >. One point there is to just make sure the array size isn't too small. The array must've been larger all the time previously. I'd have to think about why that's the case...
I'm glad you were able to figure that out.
Since ctsm5.2.dev175 to ctsm5.3.0 we've been running tests with MIMICS with above ground CN matrtix that have been passing. The test is SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn. This has the soil CN matrix off (because MIMICS is non-linear), but above ground CN matrix on (use_soil_matrixcn = .false. use_matrixcn = .true.).
There are two reasons for doing this test:
The hope for "1" was especially there as we weren't finding methods to speed up the spinup of MIMICS. The test did pass for 30 tags, and just started failing in ctsm5.3.0 with the following type of error in the log files:
lnd.log:
cesm.log:
The line it fails on from above is the SHR_ASSERT_FL in this section of code in SparseMatrixMultiplyMod.F90:
The call in CNVegMatrixMod.F90 is here:
Definition of done: