E3SM-Project / zppy

E3SM post-processing toolchain
BSD 3-Clause "New" or "Revised" License
6 stars 15 forks source link

zppy errors in E3SM Unified 1.9.2rc2 #538

Closed forsyth2 closed 7 months ago

forsyth2 commented 8 months ago

Request criteria

Issue description

Testing zppy on Chrysalis, using E3SM Unified 1.9.2rc2, I run into the following errors on the complete_run run. The errors appear to be similar on Perlmutter. Please note that there is not a new zppy release for E3SM Unified 1.9.2. That is, these errors are occuring on a zppy version that was previously tested (for E3SM Unified 1.9.1).

Error 1

e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_*:

This is the job that makes sure the environment_commands parameter is working properly.

1a

tests/integration/utils.py had "diags_environment_commands": "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20231221", meaning it ran using a conda dev environment built off the latest main of E3SM Diags (https://github.com/E3SM-Project/e3sm_diags/commit/9e14ff81a658208631cda54516b4af67ae7541f3)

===== RUN E3SM DIAGS =====

2023-12-21 14:25:12,848 [ERROR]: run.py(run_diags:90) >> Error traceback:
Traceback (most recent call last):
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 88, in run_diags
    params_results = main(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/e3sm_diags_driver.py", line 363, in main
    os.makedirs(parameters[0].results_dir, 0o755)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/os.py", line 225, in\
 makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.histori\
cal_0201/post/scripts/tmp.447022.wCBq/e3sm.py", line 53, in <module>
    runner.run_diags(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 92, in run_diags
    move_log_to_prov_dir(params_results[0].results_dir)
UnboundLocalError: local variable 'params_results' referenced before assignment
[WARNING] yaksa: 10 leaked handle pool objects
srun: error: chr-0496: task 0: Exited with exit code 1

Since this case technically tests an unreleased version of E3SM Diags, I suppose this is fine to ignore for now.

1b

I then changed tests/integration/utils.py to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is, "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh", which is the same environment_commands all the other jobs used. Looking at https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.2, that looks like that would be E3SM Diags v2.10.0 (https://github.com/E3SM-Project/e3sm_diags/commit/0b7f9c7f7581ceaed260c3baa5104f8e448ffd3f).

===== RUN E3SM DIAGS =====

2023-12-21 20:06:05,915 [ERROR]: run.py(run_diags:37) >> Error traceback:
Traceback (most recent call last):
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/run.py", line 35, in run_diags
    main(final_params)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/e3sm_diags_driver.py", line 362, in main
    os.makedirs(parameters[0].results_dir, 0o755)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts/tmp.447067.MErK/e3sm.py", line 53, in <module>
    runner.run_diags(params)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/run.py", line 38, in run_diags
    move_log_to_prov_dir(final_params[0].results_dir)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/e3sm_diags/logger.py", line 104, in move_log_to_prov_dir
    shutil.copy(LOG_FILENAME, provenance_dir)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/shutil.py", line 417, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/shutil.py", line 256, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/prov/e3sm_diags_run.log'
srun: error: chr-0493: task 0: Exited with exit code 1

Since this case tests the E3SM Diags version that is included in the upcoming Unified release, we should address this error.

1c

I then changed tests/integration/utils.py to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is, "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/load_latest_e3sm_unified_chrysalis.sh", which uses the latest official release of E3SM Unified. Looking at https://acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.1, that looks like that would be E3SM Diags v2.9.0 (https://github.com/E3SM-Project/e3sm_diags/commit/a2d00eb114cf505a4ecf4dd429bb8710778cc087).

This works fine. That is expected since zppy had previously tested this version of E3SM Diags when we did the release for E3SM Unified 1.9.1.

Potential sources of the bugs

1c -> 1b bug:

1b bug -> 1a bug:

Error 2

ilamb_*:

Traceback (most recent call last):
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/bin/ilamb-run", line 993, in <module>
    S = Scoreboard(
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 508, in __init__
    TraversePreorder(self.tree, _initConfrontation)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 124, in TraversePreorder
    TraversePreorder(child, visit)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 124, in TraversePreorder
    TraversePreorder(child, visit)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 124, in TraversePreorder
    TraversePreorder(child, visit)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 122, in TraversePreorder
    visit(node)
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Scoreboard.py", line 470, in _initConfrontation
    node.confrontation = Constructor(**(node.__dict__))
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/ConfTWSA.py", line 37, in __init__
    self.basins = r.addRegionNetCDF4(
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2rc2_chrysalis/lib/python3.10/site-packages/ILAMB/Regions.py", line 139, in addRegionNetCDF4
    dset = Dataset(filename)
  File "src/netCDF4/_netCDF4.pyx", line 2464, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 2027, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: '/lcrc/group/e3sm/diagnostics/ilamb_data/DATA/mrro/Dai/basins_0.5x0.5.nc'
srun: error: chr-0496: task 0: Exited with exit code 1

I'm not sure if this is a bug with ILAMB itself or if there is simply a missing dataset (e.g., a dataset was deleted or the new version of ILAMB requires one that I'm not pointing to).

tomvothecoder commented 8 months ago

Thanks for the clear description. I've crossed out the commits that I know wouldn't affect e3sm_diags between v2.9.0 and v2.10.0.

Can you provide me the standalone command for the e3sm_diags task? I will try stepping through the code on v2.10.0 up to where it breaks.

forsyth2 commented 8 months ago

Can you provide me the standalone command for the e3sm_diags task?

Hmm I'm trying to figure out exactly how to condense that down. There's a lot of auto-generation (and NCO dependencies) that come first.

The relevant parts of the cfg (excluding any climo/ts dependencies) would be:

Excerpt of tests/integration/generated/test_bundles_chrysalis.cfg ``` [default] case = v2.LR.historical_0201 constraint = "" dry_run = "False" environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh" input = "/lcrc/group/e3sm/ac.forsyth2//E3SMv2/v2.LR.historical_0201" input_subdir = archive/atm/hist mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc" # To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_complete_run.py output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201" partition = "debug" qos = "regular" www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/unified_1.9.2rc2" [e3sm_diags] active = True grid = '180x360_aave' ref_final_yr = 2014 ref_start_yr = 1985 # TODO: this directory is missing OMI-MLS sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere", "tc_analysis", short_name = 'v2.LR.historical_0201' ts_num_years = 2 walltime = "00:30:00" years = "1850:1854:2", "1850:1854:4", [[ atm_monthly_180x360_aave_environment_commands ]] environment_commands = "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh" sets = "qbo", ts_subsection = "atm_monthly_180x360_aave" ```

So, that ends up generating a bash file, like this:

e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.bash ``` #!/bin/bash # Running on chrysalis #SBATCH --job-name=e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851 #SBATCH --account=e3sm #SBATCH --nodes=1 #SBATCH --output=/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts/e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.o%j #SBATCH --exclusive #SBATCH --time=00:30:00 #SBATCH --partition=debug source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh # Turn on debug output if needed debug=False if [[ "${debug,,}" == "true" ]]; then set -x fi # Make sure UVCDAT doesn't prompt us about anonymous logging export UVCDAT_ANONYMOUS_LOG=False # Script dir cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts # Get jobid id=${SLURM_JOBID} # Update status file STARTTIME=$(date +%s) echo "RUNNING ${id}" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status # Basic definitions case="v2.LR.historical_0201" short="v2.LR.historical_0201" www="/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_complete_run_www/unified_1.9.2rc2" y1=1850 y2=1851 Y1="1850" Y2="1851" run_type="model_vs_obs" tag="model_vs_obs" results_dir=${tag}_${Y1}-${Y2} # Create temporary workdir workdir=`mktemp -d tmp.${id}.XXXX` cd ${workdir} create_links_climo() { climo_dir_source=$1 climo_dir_destination=$2 nc_prefix=$3 begin_year=$4 end_year=$5 error_num=$6 mkdir -p ${climo_dir_destination} cd ${climo_dir_destination} cp -s ${climo_dir_source}/${nc_prefix}_*_${begin_year}??_${end_year}??_climo.nc . if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit ${error_num} fi cd .. } create_links_climo_diurnal() { climo_diurnal_dir_source=$1 climo_diurnal_dir_destination=$2 nc_prefix=$3 begin_year=$4 end_year=$5 error_num=$6 mkdir -p ${climo_diurnal_dir_destination} cd ${climo_diurnal_dir_destination} cp -s ${climo_diurnal_dir_source}/${nc_prefix}._*_${begin_year}??_${end_year}??_climo.nc . if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit ${error_num} fi cd .. } create_links_ts() { ts_dir_source=$1 ts_dir_destination=$2 begin_year=$3 end_year=$4 error_num=$5 # Create xml files for time series variables mkdir -p ${ts_dir_destination} cd ${ts_dir_destination} # https://stackoverflow.com/questions/27702452/loop-through-a-comma-separated-shell-variable variables="FSNTOA,FLUT,FSNT,FLNT,FSNS,FLNS,SHFLX,QFLX,TAUX,TAUY,PRECC,PRECL,PRECSC,PRECSL,TS,TREFHT,CLDTOT,CLDHGH,CLDMED,CLDLOW,U" for v in ${variables//,/ } do # Go through the time series files for between year1 and year2, using a step size equal to the number of years per time series file for year in `seq ${begin_year} 2 ${end_year}`; do YYYY=`printf "%04d" ${year}` for file in ${ts_dir_source}/${v}_${YYYY}*.nc do # Add this time series file to the list of files for cdscan to use echo ${file} >> ${v}_files.txt done done # xml file will cover the whole period from year1 to year2 xml_name=${v}_${begin_year}01_${end_year}12.xml export CDMS_NO_MPI=true cdscan -x ${xml_name} -f ${v}_files.txt if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit ${error_num} fi done cd .. } create_links_ts_rof() { ts_rof_dir_source=$1 ts_rof_dir_destination=$2 begin_year=$3 end_year=$4 error_num=$5 mkdir -p ${ts_rof_dir_destination} cd ${ts_rof_dir_destination} v="RIVER_DISCHARGE_OVER_LAND_LIQ" xml_name=${v}_${begin_year}01_${end_year}12.xml cdscan -x ${xml_name} ${ts_rof_dir_source}/${v}_*.nc if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo "ERROR (${error_num})" > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit ${error_num} fi cd .. } ts_dir_primary=ts # Create xml files for time series variables ts_dir_source=/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/atm/180x360_aave/ts/monthly/2yr create_links_ts ${ts_dir_source} ${ts_dir_primary} ${Y1} ${Y2} 5 ref_name= # Run E3SM Diags echo echo ===== RUN E3SM DIAGS ===== echo # Prepare configuration file cat > e3sm.py << EOF import os import numpy from e3sm_diags.parameter.core_parameter import CoreParameter from e3sm_diags.parameter.qbo_parameter import QboParameter from e3sm_diags.run import runner short_name = '${short}' test_ts = '${ts_dir_primary}' start_yr = int('${Y1}') end_yr = int('${Y2}') num_years = end_yr - start_yr + 1 ref_start_yr = 1985 param = CoreParameter() # Model param.test_name = '${case}' param.short_test_name = short_name # Output dir param.results_dir = '${results_dir}' # Additional settings param.run_type = 'model_vs_obs' param.diff_title = 'Model - Observations' param.output_format = ['png'] param.output_format_subplot = [] param.multiprocessing = True param.num_workers = 24 #param.fail_on_incomplete = True params = [param] qbo_param = QboParameter() qbo_param.test_data_path = test_ts qbo_param.test_name = short_name qbo_param.test_start_yr = start_yr qbo_param.test_end_yr = end_yr qbo_param.ref_start_yr = ref_start_yr ref_end_yr = ref_start_yr + num_years - 1 if (ref_end_yr <= 2014): qbo_param.ref_end_yr = ref_end_yr else: qbo_param.ref_end_yr = 2014 # Obs qbo_param.reference_data_path = '/lcrc/group/e3sm/diagnostics/observations/Atm/time-series/' params.append(qbo_param) # Run runner.sets_to_run = ['qbo'] runner.run_diags(params) EOF # Handle cases when cfg file is explicitly provided command="srun -n 1 python -u e3sm.py" # Run diagnostics time ${command} if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo 'ERROR (9)' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit 9 fi # Copy output to web server echo echo ===== COPY FILES TO WEB SERVER ===== echo # Create top-level directory web_dir=${www}/${case}/e3sm_diags/atm_monthly_180x360_aave_environment_commands mkdir -p ${web_dir} if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo 'ERROR (10)' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit 10 fi # Copy files rsync -a --delete ${results_dir} ${web_dir}/ if [ $? != 0 ]; then cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.historical_0201/post/scripts echo 'ERROR (11)' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status exit 11 fi # For LCRC, change permissions of new files pushd ${web_dir}/ chmod -R go+rX,go-w ${results_dir} popd # Delete temporary workdir cd .. if [[ "${debug,,}" != "true" ]]; then rm -rf ${workdir} fi # Update status file and exit ENDTIME=$(date +%s) ELAPSEDTIME=$(($ENDTIME - $STARTTIME)) echo ============================================== echo "Elapsed time: $ELAPSEDTIME seconds" echo ============================================== rm -f e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status echo 'OK' > e3sm_diags_atm_monthly_180x360_aave_environment_commands_model_vs_obs_1850-1851.status ```

So, the command is really srun -n 1 python -u e3sm.py, but there's a whole auto-generated e3sm.py.

forsyth2 commented 8 months ago

Ah, ok, I think this would be the e3sm.py to try out. You still might need to run NCO to pre-process the data (i.e., what the ts task dependency does) though.

e3sm.py ``` import os import numpy from e3sm_diags.parameter.core_parameter import CoreParameter from e3sm_diags.parameter.qbo_parameter import QboParameter from e3sm_diags.run import runner short_name = 'v2.LR.historical_0201' test_ts = 'ts' start_yr = int('1850') end_yr = int('1851') num_years = end_yr - start_yr + 1 ref_start_yr = 1985 param = CoreParameter() # Model param.test_name = 'v2.LR.historical_0201' param.short_test_name = short_name # Output dir param.results_dir = 'model_vs_obs_1850-1851' # Additional settings param.run_type = 'model_vs_obs' param.diff_title = 'Model - Observations' param.output_format = ['png'] param.output_format_subplot = [] param.multiprocessing = True param.num_workers = 24 #param.fail_on_incomplete = True params = [param] qbo_param = QboParameter() qbo_param.test_data_path = test_ts qbo_param.test_name = short_name qbo_param.test_start_yr = start_yr qbo_param.test_end_yr = end_yr qbo_param.ref_start_yr = ref_start_yr ref_end_yr = ref_start_yr + num_years - 1 if (ref_end_yr <= 2014): qbo_param.ref_end_yr = ref_end_yr else: qbo_param.ref_end_yr = 2014 # Obs qbo_param.reference_data_path = '/lcrc/group/e3sm/diagnostics/observations/Atm/time-series/' params.append(qbo_param) # Run runner.sets_to_run = ['qbo'] runner.run_diags(params) ```
forsyth2 commented 8 months ago

The other tests for zppy are fine. (Except for the bundles run, which also runs into the ILAMB error. That test doesn't check the environment_commands setting, so it wouldn't catch the other error).

forsyth2 commented 8 months ago

Re: the ILAMB error, I ran with environment_commands set to use the E3SM Unified 1.9.1 version of ILAMB, and it worked fine. So, something changed on ILAMB between versions 2.6 and 2.7.

(Note that my note on https://github.com/E3SM-Project/zppy/issues/523#issuecomment-1847672126 is to see if ILAMB 2.7 expands the zppy output and/or allows a simpler cfg. I can't really check that until I have 2.7 working in the first place).

tomvothecoder commented 8 months ago

1b

I then changed tests/integration/utils.py to use the version of E3SM Diags that was used in the other E3SM Diags jobs for this run. That is, "diags_environment_commands": "source /lcrc/soft/climate/e3sm-unified/test_e3sm_unified_1.9.2rc2_chrysalis.sh", which is the same environment_commands all the other jobs used. Looking at acme-climate.atlassian.net/wiki/spaces/DOC/pages/129732419/Packages+in+the+E3SM+Unified+conda+environment#e3sm-unified-1.9.2, that looks like that would be E3SM Diags v2.10.0 (E3SM-Project/e3sm_diags@0b7f9c7).

Okay I figured out the root cause of the e3sm_diags results_dir issue.

Root Cause

This commit E3SM-Project/e3sm_diags@48426aa (#755) changes if type(p) == cls_type to if isinstance(p, cls_type). Apparently, both of these conditionals are not the same (source) and can return different boolean values. I changed if type(p) == cls_type because it is considered bad practice (Flake8 E721).

As a result, in v2.10.0, the results_dir config is not being copied from the first parameter (param) to the second parameter (qbo_param). This causes results_dir to be blank which cascades to the FileNotFoundError: [Errno 2] No such file or directory: '' and FileNotFoundError: [Errno 2] No such file or directory: '/prov/e3sm_diags_run.log'.

The Fix

The fix is to change if isinstance(p, cls_type) to if type(p) is cls_type. I will open a separate PR for this and get a new e3sm_diags RC release out.

Other Thoughts

This brings my idea again about testing these tools together, outside of E3SM Unified. We should consider more frequent releases and periodic testing before E3SM Unified releases. It would really cut down potential bugs appearing with E3SM Unified releases at the last second. It's not a good idea to try to rush out new package releases for "emergency" E3SM Unified releases, especially if the packages have a lot of changes.

Also, the way e3sm_diags copies parameter attributes around from different sources (parameter objects, parsers, cfg, core_parameter, etc.) is really convoluted and seems unnecessarily complex. I wish it wasn't so fragile and hard to work with.

tomvothecoder commented 8 months ago

1a

tests/integration/utils.py had "diags_environment_commands": "source /home/ac.forsyth2/miniconda3/etc/profile.d/conda.sh; conda activate e3sm_diags_20231221", meaning it ran using a conda dev environment built off the latest main of E3SM Diags (E3SM-Project/e3sm_diags@9e14ff8)

===== RUN E3SM DIAGS =====

2023-12-21 14:25:12,848 [ERROR]: run.py(run_diags:90) >> Error traceback:
Traceback (most recent call last):
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 88, in run_diags
    params_results = main(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/e3sm_diags_driver.py", line 363, in main
    os.makedirs(parameters[0].results_dir, 0o755)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/os.py", line 225, in\
 makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''
Traceback (most recent call last):
  File "/lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/unified_1.9.2rc2/v2.LR.histori\
cal_0201/post/scripts/tmp.447022.wCBq/e3sm.py", line 53, in <module>
    runner.run_diags(params)
  File "/home/ac.forsyth2/miniconda3/envs/e3sm_diags_20231221/lib/python3.10/site-packages/e3sm_d\
iags/run.py", line 92, in run_diags
    move_log_to_prov_dir(params_results[0].results_dir)
UnboundLocalError: local variable 'params_results' referenced before assignment
[WARNING] yaksa: 10 leaked handle pool objects
srun: error: chr-0496: task 0: Exited with exit code 1

Since this case technically tests an unreleased version of E3SM Diags, I suppose this is fine to ignore for now.

Just an FYI that this run fails in v2.9.0 too, but the log file saves which makes it seems like it was working. UPDATE: Actually I didn't use NCO so it might be failing when I use the test e3sm.py script as a result.

On main, it fails but the log file does not save (UnboundLocalError: local variable 'params_results' referenced before assignment)

v2.9.0 output -- log file saves


2023-12-22 17:10:18,864 [INFO]: run.py(_add_parent_attrs_to_children:152) >> ['diff_title', 'short_test_name', 'num_workers', 'results_dir']
2023-12-22 17:10:25,355 [INFO]: e3sm_diags_driver.py(_save_env_yml:59) >> Saved environment yml file to: model_vs_obs_1850-1851/prov/environment.yml
2023-12-22 17:10:25,356 [INFO]: e3sm_diags_driver.py(_save_parameter_files:70) >> Saved command used to: model_vs_obs_1850-1851/prov/cmd_used.txt
2023-12-22 17:10:25,358 [INFO]: e3sm_diags_driver.py(_save_python_script:134) >> Saved Python script to: model_vs_obs_1850-1851/prov/ipykernel_launcher.py
2023-12-22 17:10:26,954 [ERROR]: e3sm_diags_driver.py(run_diag:296) >> Error in e3sm_diags.driver.qbo_driver
Traceback (most recent call last):
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/e3sm_diags_driver.py", line 293, in run_diag
    single_result = module.run_diag(parameter)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/qbo_driver.py", line 173, in run_diag
    test_var = test_data.get_timeseries_variable(variable)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 98, in get_timeseries_variable
    variables = self._get_timeseries_var(data_path, *args, **kwargs)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 469, in _get_timeseries_var
    vars_to_func_dict = self._get_first_valid_vars_timeseries(
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/driver/utils/dataset.py", line 560, in _get_first_valid_vars_timeseries
    raise RuntimeError(msg)
RuntimeError: Neither does U nor the variables in [('ua',), ('U',)] have valid files in ts.
2023-12-22 17:10:26,959 [WARNING]: e3sm_diags_driver.py(main:426) >> There was not a single valid diagnostics run, no viewer created.
2023-12-22 17:10:26,960 [ERROR]: run.py(run_diags:37) >> Error traceback:
Traceback (most recent call last):
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/run.py", line 35, in run_diags
    main(final_params)
  File "/gpfs/fs1/home/ac.tvo/E3SM-Project/e3sm_diags_29/e3sm_diags/e3sm_diags_driver.py", line 445, in main
    if parameters_results[0].fail_on_incomplete and (
IndexError: list index out of range
2023-12-22 17:10:26,965 [INFO]: logger.py(move_log_to_prov_dir:106) >> Log file saved in model_vs_obs_1850-1851/prov/e3sm_diags_run.log
forsyth2 commented 8 months ago

As a result, in v2.10.0, the results_dir config is not being copied from the first parameter (param) to the second parameter (qbo_param).

Wow, that is a very insidious bug. So, it's just returning p right away because qbo_param is-a param? But we really want it to return p when we get to the exact type?

I will open a separate PR for this and get a new e3sm_diags RC release out.

Awesome, thanks!!

testing these tools together, outside of E3SM Unified.

I absolutely agree. I should update the testing process as follows:

It's not a good idea to try to rush out new package releases for "emergency" E3SM Unified releases, especially if the packages have a lot of changes.

That's true. https://nvie.com/posts/a-successful-git-branching-model/ suggests merging patches into the user-facing releases AND the latest development branch. That is, only the bug fixes should be getting merged into user-facing code between the non-patch releases; we shouldn't be doing whole new releases of packages according to this particular workflow ideal.

the way e3sm_diags copies parameter attributes around from different sources (parameter objects, parsers, cfg, core_parameter, etc.) is really convoluted and seems unnecessarily complex.

I agree it's very convoluted. I haven't studied it enough to know if there's a simpler way to accomplish the same thing. zppy has similar issues of parameter convolution (i.e., parameters being defined in different hierarchical sections of the cfg and some parameters being introduced internally via the <task>.py and the <task>.bash templates. The .settings file at least shows the final values for each parameter a job uses).

forsyth2 commented 8 months ago

I want to make sure I'm clear on how param_results is used. It sounds similar to param.fail_on_incomplete.

The latter, when set to True, will fail E3SM Diags if any set didn't complete. Otherwise, it will succeed as long as something makes it into the viewer.

It sounds like the former does something similar, but is not a parameter passed in by a user. It will have a value if all sets completed and otherwise will not. Is that right?

tomvothecoder commented 8 months ago

Wow, that is a very insidious bug. So, it's just returning p right away because qbo_param is-a param? But we really want it to return p when we get to the exact type?

Yeah we only want to return p if it is exact same type and not a sub-class/sub-type.

https://github.com/conda-forge/e3sm_diags-feedstock/commit/79944cdcbe340cfe4af74cb7c44d82dc35d0f16d

I want to make sure I'm clear on how param_results is used. It sounds similar to param.fail_on_incomplete.

The latter, when set to True, will fail E3SM Diags if any set didn't complete. Otherwise, it will succeed as long as something makes it into the viewer.

It sounds like the former does something similar, but is not a parameter passed in by a user. It will have a value if all sets completed and otherwise will not. Is that right?

params_results is different. It's just a variable containing params but AFTER successful diagnostic runs. If diagnostic runs fail, params_results is never set which causes UnboundLocalError: local variable 'params_results' referenced before assignment. This causes e3sm_diags to crash. param.fail_on_incomplete sounds like it will stop e3sm_diags on the first instance of a failure.

Notice in the code below that params_results has no default value and does not get assigned if the try: statement failed.

        params = self.get_run_parameters(parameters, use_cfg)

        if params is None or len(params) == 0:
            raise RuntimeError(
                "No parameters we able to be extracted. Please "
                "check the parameters you defined."
            )

        try:
            params_results = main(params)
        except Exception:
            logger.exception("Error traceback:", exc_info=True)

        move_log_to_prov_dir(params_results[0].results_dir)

        return params_results

I updated this logic in https://github.com/E3SM-Project/e3sm_diags/pull/770/files so that params_results has a default value (None) and move_log_to_prov_dir() uses params (not params_results because it can be None with failed runs).

tomvothecoder commented 8 months ago

e3sm_diags v2.10.1rc1 is now released with these fixes: https://github.com/conda-forge/e3sm_diags-feedstock/commit/79944cdcbe340cfe4af74cb7c44d82dc35d0f16d

forsyth2 commented 8 months ago

Great, thanks @tomvothecoder!

forsyth2 commented 8 months ago

Re: Error 2, I made https://github.com/rubisco-sfa/ILAMB/issues/85.

xylar commented 8 months ago

@forsyth2 and @chengzhuzhang, it seems like we probably need to run ilamb-fetch on Chrysalis (or Anvil) in the appropriate directory, probably:

/lcrc/group/e3sm/diagnostics/ilamb_data

See: https://www.ilamb.org/doc/ilamb_fetch.html

forsyth2 commented 8 months ago

Re: Error 2, the issue does appear to be from not running ilamb-fetch. See https://github.com/E3SM-Project/zppy/discussions/541