Open rljacob opened 2 years ago
Hmm, I'm not familiar with the land model and I don't have access to Chrysalis. Looks like we might have gotten a text format issue when specifying initial conditions for ELM?
I've been running the TSC test on Anvil (but TSC.ne4_ne4.F2010-CICE) without any trouble. I just ran this exact test (TSC.ne4_oQU240.F2010) on Chrysalis, via "create_test" at the command line and reproduced this error.
for this test, the tsc.py script is creating user_nleam* files (one for each instance). For some reason with this particular grid/compset, the files look like this: (extraneous characters, "he form of" are added to the file):
finidat = '/lcrc/group/e3sm/data/inputdata/lnd/clm2/initdata/ne4_oQU240_v2_init/20210915.v2.ne4_oQU240.F2010.elm.r.0002-12-01-00000.nc'
dtime = 2
he form of
! namelist_var = new_namelist_value
!
! Include namelist variables for drv_flds_in ONLY if -megan and/or -drydep options
! are set in the CLM_NAMELIST_OPTS env variable.
!
The run last night has a different error:
2022-07-20 05:11:07: Exception during BASELINE:
ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
multiprocessing.pool.RemoteTraceback:
This seems to be a strange file writing glitch on chrysalis. The user_nlelm???? (multi-instance user_nl_elm files) get corrupted during the run stage, by the TSC python code, tsc.py.
During ./create_test, before the build phase, CIME creates a bunch of user_nlelm???? files, all containting a bunch of comments (see below).
The job is then submitted to the que. When case.run starts running (on a compute node), it runs the tsc.py code. This code opens each user_nlelm???? file with the "w" attribute and writes two lines to them. This should erase all the old content and replace the file with just those two lines. But instead, it seems to only overwrite the first part of the file with those two lines, leaving a corrupted file as the final result.
Then, the elm buildnml script chokes and errors out when processing these bad files.
Here's a workaround (it's worked in consistently in 5 tests)
diff --git a/CIME/SystemTests/tsc.py b/CIME/SystemTests/tsc.py
index 330677383..2032a33cb 100644
--- a/CIME/SystemTests/tsc.py
+++ b/CIME/SystemTests/tsc.py
@@ -111,6 +111,7 @@ class TSC(SystemTestsCommon):
f"user_nl_{self.lndmod}_" + str(iinst).zfill(4), "w"
) as lndnlfile:
+ lndnlfile.truncate(0)
fatm_in = os.path.join(
csmdata_atm,
INIT_COND_FILE_TEMPLATE.format(self.atmmodIC, "i", iinst),
@@ -139,6 +140,8 @@ class TSC(SystemTestsCommon):
"".join(["'{}',".format(s) for s in VAR_LIST])[:-1]
)
)
+ atmnlfile.close()
+ lndnlfile.close()
# Force rebuild namelists
self._skip_pnl = False
user_nlelm???? files before build phase (and after build phase):
!----------------------------------------------------------------------------------
! Users should add all user specific namelist changes below in the form of
! namelist_var = new_namelist_value
!
! Include namelist variables for drv_flds_in ONLY if -megan and/or -drydep options
! are set in the CLM_NAMELIST_OPTS env variable.
!
! EXCEPTIONS:
! Set use_cndv by the compset you use and the CLM_BLDNML_OPTS -dynamic_vegetation setting
! Set use_vichydro by the compset you use and the CLM_BLDNML_OPTS -vichydro setting
! Set use_cn by the compset you use and CLM_BLDNML_OPTS -bgc setting
! Set use_crop by the compset you use and CLM_BLDNML_OPTS -crop setting
! Set spinup_state by the CLM_BLDNML_OPTS -bgc_spinup setting
! Set irrigate by the CLM_BLDNML_OPTS -irrig setting
! Set co2_ppmv with CCSM_CO2_PPMV option
! Set dtime with L_NCPL option
! Set fatmlndfrc with LND_DOMAIN_PATH/LND_DOMAIN_FILE options
! Set finidat with RUN_REFCASE/RUN_REFDATE/RUN_REFTOD options for hybrid or branch cases
! (includes $inst_string for multi-ensemble cases)
! Set glc_grid with CISM_GRID option
! Set glc_smb with GLC_SMB option
! Set maxpatch_glcmec with GLC_NEC option
! Set glc_do_dynglacier with GLC_TWO_WAY_COUPLING env variable
!----------------------------------------------------------------------------------
user_nlelm???? filea after tsc.py is run on the copute node:
finidat = '/lcrc/group/e3sm/data/inputdata/lnd/clm2/initdata/ne4_oQU240_v2_init/20210915.v2.ne4_oQU240.F2010.elm.r.0002-12-01-00000.nc'
dtime = 2
he form of
! namelist_var = new_namelist_value
!
! Include namelist variables for drv_flds_in ONLY if -megan and/or -drydep options
! are set in the CLM_NAMELIST_OPTS env variable.
!
! EXCEPTIONS:
! Set use_cndv by the compset you use and the CLM_BLDNML_OPTS -dynamic_vegetation setting
! Set use_vichydro by the compset you use and the CLM_BLDNML_OPTS -vichydro setting
! Set use_cn by the compset you use and CLM_BLDNML_OPTS -bgc setting
! Set use_crop by the compset you use and CLM_BLDNML_OPTS -crop setting
! Set spinup_state by the CLM_BLDNML_OPTS -bgc_spinup setting
! Set irrigate by the CLM_BLDNML_OPTS -irrig setting
! Set co2_ppmv with CCSM_CO2_PPMV option
! Set dtime with L_NCPL option
! Set fatmlndfrc with LND_DOMAIN_PATH/LND_DOMAIN_FILE options
! Set finidat with RUN_REFCASE/RUN_REFDATE/RUN_REFTOD options for hybrid or branch cases
! (includes $inst_string for multi-ensemble cases)
! Set glc_grid with CISM_GRID option
! Set glc_smb with GLC_SMB option
! Set maxpatch_glcmec with GLC_NEC option
! Set glc_do_dynglacier with GLC_TWO_WAY_COUPLING env variable
!----------------------------------------------------------------------------------
With the above fix (explicit file truncate after opening in "w" mode), the code will run, but then abort in MPAS_SI. MPAS_SI is detecting some unphysical variables when running with very small timesteps used by TSC. As this is a atmospheric physics test, simple workaround could be to switch the compset to F2010-CICE
@mt5555 Thanks for sorting out what was going on with the elm namelist, the above fix in CIME should work (it has in testing!)
As far as the MPAS-SI issue, in my testing, it's different depending on the init data file (month). The stall can happen at a different time step or not at all for some init files when dtime = 1. For a single instance run on the F2010 compset, only the runs initialized with months 0002-03 and 0002-05 fail among year 2 (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/ne4_v2_init/20210915.v2.ne4_oQU240.F2010.eam.i.0002-MM-01-00000.nc
).
As you say the simple fix is to use F2010-CICE, which works at dtime = 1
but I wonder if there's a need in fixing the underlying issue?
@mkstratos : please switch to F2010-CICE so we get this test passing in the dashboard again. Then make a TSC issue that MPAS-SI crashes when running in thermodynamic mode (F cases) with small timesteps and assign to @eclare108213 . Getting that fixed in MPAS-SI might be difficult to prioritize and make take some time to be addressed.
@mkstratos - could you please tell me how far I need to try to run a test case to see this problem? I'd like to see if I can replicate it in a single run, starting from one of the inic files that seems to give mpassi trouble and with dtime = 1
Based on the March inic files
/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/ne4_v2_init/20210915.v2.ne4_oQU240.F2010.eam.i.0002-03-01-00000.nc
/lcrc/group/e3sm/data/inputdata/lnd/clm2/initdata/ne4_oQU240_v2_init/20210915.v2.ne4_oQU240.F2010.elm.r.0002-03-01-00000.nc
it stalls after 73 time steps. (/lcrc/group/e3sm/ac.mkelleher/scratch/chrys/F2010.ne4_oQU240.03_test/run
)
Thanks @mkstratos -- that's exactly what I needed
I've tested a fix to get mpas-seaice happy with the 1-sec dtime, and passed it along to our main mpas-seaice developer.
The test has been failing for months possible because of #4759 but this is what is currently in the dashboard.
2022-07-09 03:16:20: ERROR: Command: '/gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_atmnbfb/ACME/components/elm/bld/build-namelist -infile /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elmconf/namelist -csmdata /lcrc/group/e3sm/data/inputdata -inputdata /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elm.input_data_list -ignore_ic_year -namelist " &elm_inparm start_ymd=00010101 /" -use_case 2010_CMIP6_control -res ne4np4 -clm_start_type default -envxml_dir /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128 -l_ncpl 43200 -lnd_frac /lcrc/group/e3sm/data/inputdata/share/domains/domain.lnd.ne4np4_oQU240.160614.nc -glc_nec 0 -co2_ppmv 388.717 -co2_type diagnostic -ncpl_base_period day -config /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elmconf/config_cache.xml -bgc sp -mask oQU240' failed with error 'ERROR(Build::Namelist::_parse_next): expect a equal '=' sign or '+=', instead got: form of' from dir '/lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elmconf'