ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
295 stars 299 forks source link

NEON tests sometimes fail because of network issues #2310

Open samsrabin opened 6 months ago

samsrabin commented 6 months ago

Brief summary of bug

The NEON tests in aux_clm sometimes fail because the network is unreachable.

General bug information

CTSM version you are using: ctsm5.1.dev162

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: NEON tests.

Details of bug

The tests:

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.derecho_gnu.clm-default--clm-NEON-NIWO
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.derecho_gnu.clm-NEON-MOAB--clm-PRISM
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_gnu.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_gnu.clm-FatesPRISM--clm-NEON-FATES-YELL
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51SpRs.derecho_gnu.clm-default--clm-NEON-TOOL

@adrifoster and @ekluzek suggest that this may be a result of problems with the NEON server and/or Derecho compute nodes' (in)ability to connect to the outside world.

Important output or errors that show the problem

2024-01-08 11:17:01: Test 'SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Bgc.derecho_gnu.clm-default--clm-NEON-NIWO' failed in phase 'SETUP' with exception 'ERROR: Fatal error in case.cmpgen_namelists: 2024-01-08 11:16:56 atm
Create namelist for component datm
   Calling /glade/u/home/samrabin/ctsm_tillage-and-residues4/components/cdeps/datm/cime_config/buildnml
WARNING: No .input_data_list files found in dir 'Buildconf'
Using protocol wget with user None and passwd None
wget failed with output:  and errput --2024-01-08 11:16:58--  https://storage.neonscience.org/neon-ncar/listing.csv
Resolving storage.neonscience.org (storage.neonscience.org)... 34.110.164.243
Connecting to storage.neonscience.org (storage.neonscience.org)|34.110.164.243|:443... failed: Network is unreachable.

ERROR: Could not download NEON data listing file from server'
  File "/glade/u/home/samrabin/ctsm_tillage-and-residues4/cime/CIME/test_scheduler.py", line 1125, in _run_catch_exceptions
    return run(test)
  File "/glade/u/home/samrabin/ctsm_tillage-and-residues4/cime/CIME/test_scheduler.py", line 1016, in _setup_phase
    "Fatal error in case.cmpgen_namelists: {}".format(output),
  File "/glade/u/home/samrabin/ctsm_tillage-and-residues4/cime/CIME/utils.py", line 175, in expect
    raise exc_type(msg)

The failure seems to happen during SHAREDLIB_BUILD or RUN, although sometimes the former stays marked as PEND in TestStatus even after the job has ended—maybe a timeout?

ekluzek commented 3 months ago

I've seen a different problem than above with these tests. With certain testmods and testnames the length of filenames for the datm forcing for NEON can exceed 256 characters which is the limit for datm right now. For example for:

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO

one of the filenames is:

/derecho/scratch/erik/tests_alpha-ctsm52mksrf25_ctsm51d174acl/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm51Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO.GC.alpha-ctsm52mksrf25_ctsm51d174acl_int/run/inputdata/atm/cdeps/v2/NIWO/NIWO_atm_2018-01.nc

which is 269 characters. I got the case to run by increasing the allowed filename in CDEPS from shr_kind_cl to shr_kind_cx which is 512. Another way to do it for NEON would be to use a relative path for the files. So

inputdata/atm/cdeps/v2/NIWO/NIWO_atm_2018-01.nc

which is shorter and more readable.

This doesn't address the server issue in the main text of this. However, my guess is that for that you just need to run

./check_input_data --download

in your test directory (possibly a few times) for a time when the server is up and the data can be transferred.

samsrabin commented 3 months ago

@ekluzek That seems more related to #2322 than this issue. I'll change the title of this issue to be more specific.

Also, for posterity: In some SE meeting, we decided that the fix for this issue would be to stop relying on the NEON servers in these tests. Instead, we'll download the necessary data somewhere and just point to that.

ekluzek commented 3 months ago

I also just ran into trouble with this for the python system tests. I hadn't seen this before so documenting here:

I also noticed that when this comes up the tests hang for a long time (relative to normal speed) before it fails. So it taking a long time is an indicator that this problem is coming up. I also noticed that the server issue is likely to stay a problem for several minutes, but can fix itself 5-10 minutes later, but then come up again in a similar time period after that.

(ctsm_pylib) ctsm5.1.dev175/python> ./run_ctsm_py_tests --sys
................
Inactive Modules:
  1) hdf5/1.12.2     2) intel/2023.0.0     3) ncarcompilers/1.0.0     4) netcdf/4.9.2

Due to MODULEPATH changes, the following have been reloaded:
  1) conda/latest     2) craype/2.7.20

The following have been reloaded with a version change:
  1) cdo/2.1.1 => cdo/2.3.0     2) ncarenv/23.06 => ncarenv/23.09     3) nco/5.1.4 => nco/5.1.9     4) ncview/2.1.8 => ncview/2.1.9

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) cesmdev/1.0   2) ncarenv/23.09
Done converting /glade/derecho/scratch/erik/tmp/tmpug76mhug/scrip.nc
...E
Stdout:
in neonsite adding usermodsdirs
usermodsdirs: ['/glade/derecho/scratch/erik/ctsm5.1.dev175/cime_config/usermods_dirs/NEON/BART']
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---
Time required to building the base case: 397.0775320529938 s.
using this version: latest
---- cloning the base case in /glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient
Model datm missing file file1 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-01.nc'
Model datm missing file file2 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-02.nc'
Model datm missing file file3 = '
.
.
.
Model datm missing file file68 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2023-08.nc'
Model datm missing file file69 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2023-09.nc'
Model ctsm missing file finidat = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/lnd/ctsm/initdata/BART.2022-11-11.clm2.r.0418-01-01-00000.nc'

======================================================================
ERROR: test_one_site (test.test_sys_run_neon.TestSysRunNeon)
This test specifies a site to run
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/test/test_sys_run_neon.py", line 57, in test_one_site
    main("")
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/run_neon.py", line 241, in main
    experiment,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/neon_site.py", line 103, in run_case
    base_case_root, run_type, prism, run_length, user_version, tower_type, user_mods_dirs
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/python/ctsm/site_and_regional/tower_site.py", line 416, in run_case
    case.submit(no_batch=no_batch)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 277, in submit
    is_batch=is_batch,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 2480, in run_and_log_case_status
    rv = func()
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 270, in <lambda>
    dryrun=dryrun,
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 163, in _submit
    case.check_case(skip_pnl=skip_pnl, chksum=chksum)
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/case/case_submit.py", line 358, in check_case
    "Build complete is not True please rebuild the model by calling case.build",
  File "/glade/derecho/scratch/erik/ctsm5.1.dev175/cime/CIME/utils.py", line 176, in expect
    raise exc_type(msg)
CIME.utils.CIMEError: ERROR: Build complete is not True please rebuild the model by calling case.build

Stdout:
in neonsite adding usermodsdirs
usermodsdirs: ['/glade/derecho/scratch/erik/ctsm5.1.dev175/cime_config/usermods_dirs/NEON/BART']
---- building a base case -------
---- creating a base case -------
---- base case created ------
---- base case setup ------
---- base case build ------
--- This may take a while and you may see WARNING messages ---
Time required to building the base case: 397.0775320529938 s.
using this version: latest
---- cloning the base case in /glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient
Model datm missing file file1 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-01.nc'
Model datm missing file file2 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-02.nc'
Model datm missing file file3 = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/atm/cdeps/v3/BART/BART_atm_2018-03.nc'
.
.
.
Model ctsm missing file finidat = '/glade/derecho/scratch/erik/tmp/tmpz7cs37xq/BART.transient/run/inputdata/lnd/ctsm/initdata/BART.2022-11-11.clm2.r.0418-01-01-00000.nc'

----------------------------------------------------------------------
Ran 20 tests in 457.158s

FAILED (errors=1)