E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
351 stars 360 forks source link

TSC test failing in elm build-namelist #5072

Open rljacob opened 2 years ago

rljacob commented 2 years ago

The test has been failing for months possible because of #4759 but this is what is currently in the dashboard.

2022-07-09 03:16:20: ERROR: Command: '/gpfs/fs1/home/e3smtest/jenkins/workspace/ACME_chrysalis_atmnbfb/ACME/components/elm/bld/build-namelist -infile /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elmconf/namelist -csmdata /lcrc/group/e3sm/data/inputdata -inputdata /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elm.input_data_list -ignore_ic_year -namelist " &elm_inparm start_ymd=00010101 /" -use_case 2010_CMIP6_control -res ne4np4 -clm_start_type default -envxml_dir /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128 -l_ncpl 43200 -lnd_frac /lcrc/group/e3sm/data/inputdata/share/domains/domain.lnd.ne4np4_oQU240.160614.nc -glc_nec 0 -co2_ppmv 388.717 -co2_type diagnostic -ncpl_base_period day -config /lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elmconf/config_cache.xml -bgc sp -mask oQU240' failed with error 'ERROR(Build::Namelist::_parse_next): expect a equal '=' sign or '+=', instead got: form of' from dir '/lcrc/group/e3sm/e3smtest/scratch/chrys/J/TSC.ne4_oQU240.F2010.chrysalis_intel.C.JNextAtm_nbfb20220709_003128/Buildconf/elmconf'

huiwanpnnl commented 2 years ago

Hmm, I'm not familiar with the land model and I don't have access to Chrysalis. Looks like we might have gotten a text format issue when specifying initial conditions for ELM?

mt5555 commented 2 years ago

I've been running the TSC test on Anvil (but TSC.ne4_ne4.F2010-CICE) without any trouble. I just ran this exact test (TSC.ne4_oQU240.F2010) on Chrysalis, via "create_test" at the command line and reproduced this error.

for this test, the tsc.py script is creating user_nleam* files (one for each instance). For some reason with this particular grid/compset, the files look like this: (extraneous characters, "he form of" are added to the file):

finidat = '/lcrc/group/e3sm/data/inputdata/lnd/clm2/initdata/ne4_oQU240_v2_init/20210915.v2.ne4_oQU240.F2010.elm.r.0002-12-01-00000.nc'
dtime = 2
he form of
! namelist_var = new_namelist_value
!
! Include namelist variables for drv_flds_in ONLY if -megan and/or -drydep options
! are set in the CLM_NAMELIST_OPTS env variable.
!
rljacob commented 2 years ago

The run last night has a different error:

2022-07-20 05:11:07: Exception during BASELINE:
ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
multiprocessing.pool.RemoteTraceback: 
mt5555 commented 2 years ago

This seems to be a strange file writing glitch on chrysalis. The user_nlelm???? (multi-instance user_nl_elm files) get corrupted during the run stage, by the TSC python code, tsc.py.

During ./create_test, before the build phase, CIME creates a bunch of user_nlelm???? files, all containting a bunch of comments (see below).

The job is then submitted to the que. When case.run starts running (on a compute node), it runs the tsc.py code. This code opens each user_nlelm???? file with the "w" attribute and writes two lines to them. This should erase all the old content and replace the file with just those two lines. But instead, it seems to only overwrite the first part of the file with those two lines, leaving a corrupted file as the final result.

Then, the elm buildnml script chokes and errors out when processing these bad files.

Here's a workaround (it's worked in consistently in 5 tests)

diff --git a/CIME/SystemTests/tsc.py b/CIME/SystemTests/tsc.py
index 330677383..2032a33cb 100644
--- a/CIME/SystemTests/tsc.py
+++ b/CIME/SystemTests/tsc.py
@@ -111,6 +111,7 @@ class TSC(SystemTestsCommon):
                 f"user_nl_{self.lndmod}_" + str(iinst).zfill(4), "w"
             ) as lndnlfile:

+                lndnlfile.truncate(0)
                 fatm_in = os.path.join(
                     csmdata_atm,
                     INIT_COND_FILE_TEMPLATE.format(self.atmmodIC, "i", iinst),
@@ -139,6 +140,8 @@ class TSC(SystemTestsCommon):
                         "".join(["'{}',".format(s) for s in VAR_LIST])[:-1]
                     )
                 )
+                atmnlfile.close()
+                lndnlfile.close()

         # Force rebuild namelists
         self._skip_pnl = False

user_nlelm???? files before build phase (and after build phase):

!----------------------------------------------------------------------------------
! Users should add all user specific namelist changes below in the form of
! namelist_var = new_namelist_value
!
! Include namelist variables for drv_flds_in ONLY if -megan and/or -drydep options
! are set in the CLM_NAMELIST_OPTS env variable.
!
! EXCEPTIONS:
! Set use_cndv           by the compset you use and the CLM_BLDNML_OPTS -dynamic_vegetation setting
! Set use_vichydro       by the compset you use and the CLM_BLDNML_OPTS -vichydro           setting
! Set use_cn             by the compset you use and CLM_BLDNML_OPTS -bgc  setting
! Set use_crop           by the compset you use and CLM_BLDNML_OPTS -crop setting
! Set spinup_state       by the CLM_BLDNML_OPTS -bgc_spinup      setting
! Set irrigate           by the CLM_BLDNML_OPTS -irrig           setting
! Set co2_ppmv           with CCSM_CO2_PPMV                      option
! Set dtime              with L_NCPL                             option
! Set fatmlndfrc         with LND_DOMAIN_PATH/LND_DOMAIN_FILE    options
! Set finidat            with RUN_REFCASE/RUN_REFDATE/RUN_REFTOD options for hybrid or branch cases
!                        (includes $inst_string for multi-ensemble cases)
! Set glc_grid           with CISM_GRID                          option
! Set glc_smb            with GLC_SMB                            option
! Set maxpatch_glcmec    with GLC_NEC                            option
! Set glc_do_dynglacier  with GLC_TWO_WAY_COUPLING               env variable
!----------------------------------------------------------------------------------

user_nlelm???? filea after tsc.py is run on the copute node:

finidat = '/lcrc/group/e3sm/data/inputdata/lnd/clm2/initdata/ne4_oQU240_v2_init/20210915.v2.ne4_oQU240.F2010.elm.r.0002-12-01-00000.nc'
dtime = 2
he form of
! namelist_var = new_namelist_value
!
! Include namelist variables for drv_flds_in ONLY if -megan and/or -drydep options
! are set in the CLM_NAMELIST_OPTS env variable.
!
! EXCEPTIONS:
! Set use_cndv           by the compset you use and the CLM_BLDNML_OPTS -dynamic_vegetation setting
! Set use_vichydro       by the compset you use and the CLM_BLDNML_OPTS -vichydro           setting
! Set use_cn             by the compset you use and CLM_BLDNML_OPTS -bgc  setting
! Set use_crop           by the compset you use and CLM_BLDNML_OPTS -crop setting
! Set spinup_state       by the CLM_BLDNML_OPTS -bgc_spinup      setting
! Set irrigate           by the CLM_BLDNML_OPTS -irrig           setting
! Set co2_ppmv           with CCSM_CO2_PPMV                      option
! Set dtime              with L_NCPL                             option
! Set fatmlndfrc         with LND_DOMAIN_PATH/LND_DOMAIN_FILE    options
! Set finidat            with RUN_REFCASE/RUN_REFDATE/RUN_REFTOD options for hybrid or branch cases
!                        (includes $inst_string for multi-ensemble cases)
! Set glc_grid           with CISM_GRID                          option
! Set glc_smb            with GLC_SMB                            option
! Set maxpatch_glcmec    with GLC_NEC                            option
! Set glc_do_dynglacier  with GLC_TWO_WAY_COUPLING               env variable
!----------------------------------------------------------------------------------
mt5555 commented 2 years ago

With the above fix (explicit file truncate after opening in "w" mode), the code will run, but then abort in MPAS_SI. MPAS_SI is detecting some unphysical variables when running with very small timesteps used by TSC. As this is a atmospheric physics test, simple workaround could be to switch the compset to F2010-CICE

mkstratos commented 2 years ago

@mt5555 Thanks for sorting out what was going on with the elm namelist, the above fix in CIME should work (it has in testing!)

As far as the MPAS-SI issue, in my testing, it's different depending on the init data file (month). The stall can happen at a different time step or not at all for some init files when dtime = 1. For a single instance run on the F2010 compset, only the runs initialized with months 0002-03 and 0002-05 fail among year 2 (/lcrc/group/e3sm/data/inputdata/atm/cam/inic/homme/ne4_v2_init/20210915.v2.ne4_oQU240.F2010.eam.i.0002-MM-01-00000.nc).

As you say the simple fix is to use F2010-CICE, which works at dtime = 1 but I wonder if there's a need in fixing the underlying issue?

mt5555 commented 2 years ago

@mkstratos : please switch to F2010-CICE so we get this test passing in the dashboard again. Then make a TSC issue that MPAS-SI crashes when running in thermodynamic mode (F cases) with small timesteps and assign to @eclare108213 . Getting that fixed in MPAS-SI might be difficult to prioritize and make take some time to be addressed.

jonbob commented 2 years ago

@mkstratos - could you please tell me how far I need to try to run a test case to see this problem? I'd like to see if I can replicate it in a single run, starting from one of the inic files that seems to give mpassi trouble and with dtime = 1

mkstratos commented 2 years ago

Based on the March inic files

it stalls after 73 time steps. (/lcrc/group/e3sm/ac.mkelleher/scratch/chrys/F2010.ne4_oQU240.03_test/run)

jonbob commented 2 years ago

Thanks @mkstratos -- that's exactly what I needed

jonbob commented 2 years ago

I've tested a fix to get mpas-seaice happy with the 1-sec dtime, and passed it along to our main mpas-seaice developer.