E3SM-Project / zppy

E3SM post-processing toolchain
BSD 3-Clause "New" or "Revised" License
6 stars 14 forks source link

[Bug]: e3sm_to_cmip exception running bundles on Unified 1.9.2 #543

Closed forsyth2 closed 6 months ago

forsyth2 commented 7 months ago

What happened?

I was running the "c. test final Unified" steps of https://e3sm-project.github.io/zppy/_build/html/main/dev_guide/release_testing.html, for Unified 1.9.2 (that is, testing what was actually released).

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test_unified_1.9.2/v2.LR.historical_0201/post/scripts/
$ grep -v "OK" *status
# Nothing shows up. Good, complete_run ran successfully.
$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
bundle1.status:ERROR
ts_land_monthly_1850-1851-0002.status:ERROR (5)
$ grep -n ts_land_monthly_1850-1851-0002 bundle1.o463341 
1327:=== ts_land_monthly_1850-1851-0002.bash ===
1377:2024-01-30 03:14:58,326 [INFO]: __main__.py(__init__:147) >>     * output_path='/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002'
1378:2024-01-30 03:14:58,326 [INFO]: __main__.py(__init__:147) >>     * output_path='/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002'
1379:2024-01-30 03:14:58,326_326:INFO:__init__:    * output_path='/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002'
1445:mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

I see the following in the output file:

2024-01-30 03:15:06,173_173:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__\
main__.py", line 912, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py",\
 line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py",\
 line 403, in __get_result
    raise self._exception

However, this appears to happen elsewhere without causing complete failures:

$ grep -n "concurrent/futures/_base.py" bundle1.o463341 
1021:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
1023:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
1430:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
1432:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result

What machine were you running on?

Chrysalis

Environment

E3SM Unified 1.9.2 (zppy v2.3.0)

What command did you run?

zppy -c tests/integration/generated/test_bundles_chrysalis.cfg

Copy your cfg file

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_bundles.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201"
partition = "compute"
qos = "regular"
walltime = "07:00:00"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_bundles_www/test_unified_1.9.2"

[bundle]

  [[ bundle2 ]]
  nodes = 2
  walltime = "00:59:00"

[climo]
active = True
bundle = "bundle1"
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave ]]
  frequency = "monthly"

  [[ atm_monthly_diurnal_8xdaily_180x360_aave ]]
  frequency = "diurnal_8xdaily"
  input_files = "eam.h4"
  input_subdir = "archive/atm/hist"
  vars = "PRECT"

[ts]
active = True
bundle = "bundle1"
years = "1850:1854:2",

  [[ atm_monthly_180x360_aave ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  ts_fmt = "cmip"

  [[ atm_daily_180x360_aave ]]
  frequency = "daily"
  input_files = "eam.h1"
  input_subdir = "archive/atm/hist"
  vars = "PRECT"

  [[ atm_monthly_glb ]]
  bundle = "bundle2" # Override bundle1
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"
  years = "1850:1860:5",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,LAISHA,LAISUN,RH2M"
  ts_fmt = "cmip"

  [[ rof_monthly ]]
  bundle = "bundle3" # Override bundle1, let bundle1 finish first because "e3sm_diags: atm_monthly_180x360_aave_mvm" requires "ts: atm_monthly_180x360_aave"
  extra_vars = 'areatotal2'
  frequency = "monthly"
  input_files = "mosart.h0"
  input_subdir = "archive/rof/hist"
  mapping_file = ""
  vars = "RIVER_DISCHARGE_OVER_LAND_LIQ"

[tc_analysis]
active = True
bundle = "bundle3" # Let bundle1 finish first because "e3sm_diags: atm_monthly_180x360_aave_mvm" requires "ts: atm_monthly_180x360_aave"
scratch = "/lcrc/globalscratch/ac.forsyth2/"
years = "1850:1852:2",

[e3sm_diags]
active = True
grid = '180x360_aave'
ref_final_yr = 2014
ref_start_yr = 1985
sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere", "tc_analysis",
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave ]]
  bundle = "bundle1"
  climo_diurnal_frequency = "diurnal_8xdaily"
  climo_diurnal_subsection = "atm_monthly_diurnal_8xdaily_180x360_aave"
  sets = "polar","enso_diags","diurnal_cycle",

  [[ atm_monthly_180x360_aave_mvm ]]
  # Test model-vs-model using the same files as the reference
  bundle = "bundle3"
  climo_subsection = "atm_monthly_180x360_aave"
  diff_title = "Difference"
  ref_final_yr = 1851
  ref_name = "v2.LR.historical_0201"
  ref_start_yr = 1850
  ref_years = "1850-1851",
  reference_data_path = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/atm/180x360_aave/clim"
  run_type = "model_vs_model"
  sets = "polar","enso_diags","streamflow","tc_analysis",
  short_ref_name = "v2.LR.historical_0201"
  swap_test_ref = False
  tag = "model_vs_model"
  ts_num_years_ref = 2
  ts_subsection = "atm_monthly_180x360_aave"

[mpas_analysis]
active = False

[global_time_series]
active = True
atmosphere_only = True
bundle = "bundle2"
experiment_name = "v2.LR.historical_0201"
figstr = "v2_historical_0201"
ts_num_years = 5
walltime = "00:30:00" # bundle2 should take walltime from "ts: atm_monthly_glb", i.e., "02:00:00"
years = "1850-1860",

[ilamb]
active = True
# No bundle, let bundle1 finish first because "ilamb" requires "ts: atm_monthly_180x360_aave"
grid = '180x360_aave'
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1852:2",

What jobs are failing?

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
bundle1.status:ERROR
ts_land_monthly_1850-1851-0002.status:ERROR (5)

What stack trace are you encountering?

No response

forsyth2 commented 7 months ago

Strangely, this error does in fact occur on Chrysalis using Unified 1.9.2rc3. I know I tested that successfully though. Therefore, something has changed to affect past versions.

I will try on Perlmutter too.

forsyth2 commented 7 months ago

@chengzhuzhang Interestingly, I don't actually see this error on Perlmutter. Since Perlmutter is the primary machine people use bundles on, I suppose we can mark this lower priority.

xylar commented 7 months ago

@forsyth2 and @chengzhuzhang, I've been keeping an eye on this. Is this something you expect to have diagnosed and fixed soon? @wlin7 found another bug in MPAS-Analysis, https://github.com/MPAS-Dev/MPAS-Analysis/issues/981, that will require another bug-fix release of E3SM-Unified. I could include a fix for this if need be.

forsyth2 commented 7 months ago

@xylar I think we decided it's lower priority. I'm not quite sure what would cause this issue.

xylar commented 7 months ago

Okay, just wanted to check.

chengzhuzhang commented 7 months ago

@forsyth2 I think there are two issues (https://github.com/E3SM-Project/zppy/discussions/544 and https://github.com/E3SM-Project/zppy/issues/546 )we should perhaps consider to figure out, if they are user errors or need a fix in zppy.

forsyth2 commented 6 months ago

that will require another bug-fix release of E3SM-Unified

@xylar Do you have a timeline/expected deadline for this?

For reference, our prioritized list for zppy:

xylar commented 6 months ago

I have already tested 1.9.3rc1. It fixed the MPAS-Analysis issue it was meant to fix.

I would be willing to wait until early next week and then make a second and hopefully final rc but I don't want a process that snowballs and takes 2 months like 1.9.2 did (which was partly because of the holidays).

forsyth2 commented 6 months ago

@chengzhuzhang How urgent are we deeming the above issues? I think it's unlikely they could all be fixed by next week.

chengzhuzhang commented 6 months ago

@forsyth2 https://github.com/E3SM-Project/zppy/pull/424 (I need to run the test suite and make fixes) and https://github.com/E3SM-Project/zppy/pull/548 are ready to review. Please help review and integrate.

Have you had a chance to look at the other two (#544 and #546)? If not I will try to look into both and see if quick fixes are possible.

And I don't think https://github.com/E3SM-Project/zppy/issues/543 is a priority.

chengzhuzhang commented 6 months ago

@xylar thanks for the heads-up. e3sm_diags will have a new release as well. I will work with @tomvothecoder to have the release candidate ready by this week.

forsyth2 commented 6 months ago

424 (I need to run the test suite and make fixes)

and #548 are ready to review. Please help review and integrate.

I will test/code-review those tomorrow morning, I'm out-of-office this afternoon.

Have you had a chance to look at the other two (#544 and #546)? If not I will try to look into both and see if quick fixes are possible.

Not yet. I will try to take a look at those tomorrow too.

And I don't think #543 is a priority.

Sounds good.

xylar commented 6 months ago

Okay, I'll expect zppy and e3sm_diags RCs by sometime next week and I can make an E3SM-Unified rc2 after that.

forsyth2 commented 6 months ago

In testing #424, I got the bundles test passing. I think it was a combination of two issues 1) cannot stat error happens on two land variables, so I removed those, 2) it looks like at some point I accidentally updated the expected bundles files to be a non-merged PR's output, so I updated the expected files.

In any case, since bundles is passing, I'm closing this.