NOAA-EMC / GDASApp

Global Data Assimilation System Application
GNU Lesser General Public License v2.1
15 stars 31 forks source link

updates to build and run some ctests on WCOSS2 #1122

Closed RussTreadon-NOAA closed 4 months ago

RussTreadon-NOAA commented 4 months ago

This PR includes changes which

Resolves #1111

RussTreadon-NOAA commented 4 months ago

Run ctests on Cactus with the following results

Test project /lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/build
      Start 1337: test_gdasapp_util_coding_norms
 1/47 Test #1337: test_gdasapp_util_coding_norms ........................   Passed    3.49 sec
      Start 1338: test_gdasapp_util_ioda_example
 2/47 Test #1338: test_gdasapp_util_ioda_example ........................   Passed    0.25 sec
      Start 1339: test_gdasapp_util_prepdata
 3/47 Test #1339: test_gdasapp_util_prepdata ............................   Passed    0.81 sec
      Start 1340: test_gdasapp_util_rads2ioda
 4/47 Test #1340: test_gdasapp_util_rads2ioda ...........................   Passed    0.14 sec
      Start 1341: test_gdasapp_util_ghrsst2ioda
 5/47 Test #1341: test_gdasapp_util_ghrsst2ioda .........................   Passed    0.13 sec
      Start 1342: test_gdasapp_util_smap2ioda
 6/47 Test #1342: test_gdasapp_util_smap2ioda ...........................   Passed    0.12 sec
      Start 1343: test_gdasapp_util_smos2ioda
 7/47 Test #1343: test_gdasapp_util_smos2ioda ...........................   Passed    0.15 sec
      Start 1344: test_gdasapp_util_viirsaod2ioda
 8/47 Test #1344: test_gdasapp_util_viirsaod2ioda .......................   Passed    0.13 sec
      Start 1345: test_gdasapp_util_icecamsr2ioda
 9/47 Test #1345: test_gdasapp_util_icecamsr2ioda .......................   Passed    0.12 sec
      Start 1682: test_gdasapp_check_python_norms
10/47 Test #1682: test_gdasapp_check_python_norms .......................   Passed    5.83 sec
      Start 1683: test_gdasapp_check_yaml_keys
11/47 Test #1683: test_gdasapp_check_yaml_keys ..........................   Passed    0.24 sec
      Start 1684: test_gdasapp_jedi_increment_to_fv3
12/47 Test #1684: test_gdasapp_jedi_increment_to_fv3 ....................   Passed    0.68 sec
      Start 1685: test_gdasapp_setup_cycled_exp
13/47 Test #1685: test_gdasapp_setup_cycled_exp .........................   Passed    1.88 sec
      Start 1686: test_gdasapp_fv3jedi_fv3inc
Could not find executable srun
Looked in the following places:
srun
srun
Release/srun
Release/srun
Debug/srun
Debug/srun
MinSizeRel/srun
MinSizeRel/srun
RelWithDebInfo/srun
RelWithDebInfo/srun
Deployment/srun
Deployment/srun
Development/srun
Development/srun
Unable to find executable: srun
14/47 Test #1686: test_gdasapp_fv3jedi_fv3inc ...........................***Not Run   0.00 sec
      Start 1687: test_gdasapp_soca_nsst_increment_to_mom6
15/47 Test #1687: test_gdasapp_soca_nsst_increment_to_mom6 ..............***Failed    1.64 sec
      Start 1688: test_gdasapp_soca_prep
16/47 Test #1688: test_gdasapp_soca_prep ................................   Passed    3.11 sec
      Start 1689: test_gdasapp_soca_run_clean
17/47 Test #1689: test_gdasapp_soca_run_clean ...........................   Passed    0.02 sec
      Start 1690: test_gdasapp_soca_setup_obsprep
18/47 Test #1690: test_gdasapp_soca_setup_obsprep .......................   Passed   13.11 sec
      Start 1691: test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS
19/47 Test #1691: test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS ..............***Failed    1.58 sec
      Start 1692: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP
20/47 Test #1692: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP ....***Failed    0.20 sec
      Start 1693: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT
21/47 Test #1693: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT ....***Failed    0.21 sec
      Start 1694: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN
22/47 Test #1694: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN .....***Failed    0.23 sec
      Start 1695: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN
23/47 Test #1695: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN ....***Failed    0.21 sec
      Start 1696: test_gdasapp_soca_copy_scratch
24/47 Test #1696: test_gdasapp_soca_copy_scratch ........................***Failed    0.03 sec
      Start 1697: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT
25/47 Test #1697: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT ...***Failed    0.20 sec
      Start 1698: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST
26/47 Test #1698: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST ....***Failed    0.20 sec
      Start 1699: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY
27/47 Test #1699: test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY ....***Failed    0.24 sec
      Start 1700: test_gdasapp_soca_socahybridweights
28/47 Test #1700: test_gdasapp_soca_socahybridweights ...................***Failed    0.17 sec
      Start 1701: test_gdasapp_soca_incr_handler
29/47 Test #1701: test_gdasapp_soca_incr_handler ........................***Failed    0.17 sec
      Start 1702: test_gdasapp_soca_ens_handler
30/47 Test #1702: test_gdasapp_soca_ens_handler .........................***Failed    0.17 sec
      Start 1703: test_gdasapp_snow_create_ens
31/47 Test #1703: test_gdasapp_snow_create_ens ..........................   Passed    0.83 sec
      Start 1704: test_gdasapp_snow_imsproc
32/47 Test #1704: test_gdasapp_snow_imsproc .............................   Passed    3.05 sec
      Start 1705: test_gdasapp_snow_apply_jediincr
33/47 Test #1705: test_gdasapp_snow_apply_jediincr ......................***Failed    0.32 sec
      Start 1706: test_gdasapp_snow_letkfoi_snowda
34/47 Test #1706: test_gdasapp_snow_letkfoi_snowda ......................***Failed    0.58 sec
      Start 1707: test_gdasapp_convert_bufr_adpsfc_snow
35/47 Test #1707: test_gdasapp_convert_bufr_adpsfc_snow .................   Passed    2.50 sec
      Start 1711: test_gdasapp_convert_bufr_adpsfc
36/47 Test #1711: test_gdasapp_convert_bufr_adpsfc ......................   Passed    4.04 sec
      Start 1712: test_gdasapp_convert_gsi_satbias
37/47 Test #1712: test_gdasapp_convert_gsi_satbias ......................   Passed    2.71 sec
      Start 1713: test_gdasapp_setup_atm_cycled_exp
38/47 Test #1713: test_gdasapp_setup_atm_cycled_exp .....................   Passed    2.48 sec
      Start 1714: test_gdasapp_atm_jjob_var_init
39/47 Test #1714: test_gdasapp_atm_jjob_var_init ........................   Passed   31.94 sec
      Start 1715: test_gdasapp_atm_jjob_var_run
40/47 Test #1715: test_gdasapp_atm_jjob_var_run .........................***Failed    6.43 sec
      Start 1716: test_gdasapp_atm_jjob_var_inc
41/47 Test #1716: test_gdasapp_atm_jjob_var_inc .........................***Failed    9.48 sec
      Start 1717: test_gdasapp_atm_jjob_var_final
42/47 Test #1717: test_gdasapp_atm_jjob_var_final .......................***Failed    6.11 sec
      Start 1718: test_gdasapp_atm_jjob_ens_init
43/47 Test #1718: test_gdasapp_atm_jjob_ens_init ........................   Passed   27.69 sec
      Start 1719: test_gdasapp_atm_jjob_ens_run
44/47 Test #1719: test_gdasapp_atm_jjob_ens_run .........................***Failed    0.06 sec
      Start 1720: test_gdasapp_atm_jjob_ens_inc
45/47 Test #1720: test_gdasapp_atm_jjob_ens_inc .........................***Failed    0.06 sec
      Start 1721: test_gdasapp_atm_jjob_ens_final
46/47 Test #1721: test_gdasapp_atm_jjob_ens_final .......................***Failed    8.86 sec
      Start 1722: test_gdasapp_aero_gen_3dvar_yaml
47/47 Test #1722: test_gdasapp_aero_gen_3dvar_yaml ......................***Failed    0.15 sec

51% tests passed, 23 tests failed out of 47

Label Time Summary:
gdas-utils    =   5.35 sec*proc (9 tests)
script        =   5.35 sec*proc (9 tests)

Total Test time (real) = 148.51 sec

The following tests FAILED:
        1686 - test_gdasapp_fv3jedi_fv3inc (Not Run)
        1687 - test_gdasapp_soca_nsst_increment_to_mom6 (Failed)
        1691 - test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS (Failed)
        1692 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
        1693 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
        1694 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1695 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN (Failed)
        1696 - test_gdasapp_soca_copy_scratch (Failed)
        1697 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1698 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1699 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
        1700 - test_gdasapp_soca_socahybridweights (Failed)
        1701 - test_gdasapp_soca_incr_handler (Failed)
        1702 - test_gdasapp_soca_ens_handler (Failed)
        1705 - test_gdasapp_snow_apply_jediincr (Failed)
        1706 - test_gdasapp_snow_letkfoi_snowda (Failed)
        1715 - test_gdasapp_atm_jjob_var_run (Failed)
        1716 - test_gdasapp_atm_jjob_var_inc (Failed)
        1717 - test_gdasapp_atm_jjob_var_final (Failed)
        1719 - test_gdasapp_atm_jjob_ens_run (Failed)
        1720 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1721 - test_gdasapp_atm_jjob_ens_final (Failed)
        1722 - test_gdasapp_aero_gen_3dvar_yaml (Failed)
Errors while running CTest
RussTreadon-NOAA commented 4 months ago

test_gdasapp_fv3jedi_fv3inc As indicted by the ctest output this test fails because srun is hardwired in test/fv3jedi/CMakeLists.txt

add_test(NAME test_gdasapp_fv3jedi_fv3inc
         COMMAND srun -n6 ${CMAKE_BINARY_DIR}/bin/fv3jedi_fv3inc.x ${PROJECT_BINARY_DIR}/test/fv3jedi/testinput/gdasapp_fv3jedi_fv3inc.yaml
         WORKING_DIRECTORY ${PROJECT_BINARY_DIR}/test/fv3jedi)

WCOSS2 uses PBS, not SLURM.

test_gdasapp_soca_nsst_increment_to_mom6 Rerun this test with -VV. The test fails because

1687: Traceback (most recent call last):
1687:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/socaincr2mom6.py", line 8, in <module>
1687:     import ufsda
1687:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/ush/ufsda/__init__.py", line 2, in <module>
1687:     from .ufs_yaml import gen_yaml, parse_config
1687:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/ush/ufsda/ufs_yaml.py", line 3, in <module>
1687:     from wxflow import YAMLFile, TemplateConstants, Template
1687: ModuleNotFoundError: No module named 'wxflow'

Notice that hera.intel.lua includes

-- hack for wxflow
prepend_path("PYTHONPATH", "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src")

On Orion, pip list includes

wxflow             0.1.0

after loading GDAS/orion.intel.lua

The current wcoss2.intel.lua contains the hera.intel.lua wxflow hack. Obviously this won't work on WCOSS2. Do we need to install wxflow on Cactus or is it already available? If it is available, where is it? What do you think @CoryMartin-NOAA?

CoryMartin-NOAA commented 4 months ago

I think it gets cloned as part of the global workflow. Perhaps we can use that somehow?

RussTreadon-NOAA commented 4 months ago

test_gdasapp_snow_apply_jediincr, test_gdasapp_snow_letkfoi_snowda

Rerun these tests with -VV. Output indicates that both jobs fail on Cactus because srun is being executed. Script test/snow/apply_jedi_incr contains

# (n=6) -> this is fixed, at one task per tile (with minor code change, could run on a single proc).                                                        
srun '--export=ALL' -n 6 ${EXECDIR}/apply_incr.exe ${WORKDIR}/apply_incr.log

Script test/snow/letkfoi_snowda.sh contains

srun '--export=ALL' -n 6 ${EXECDIR}/${JEDI_EXEC} letkf_snow.yaml

These scripts need to be generalized to allow other workflow commands

RussTreadon-NOAA commented 4 months ago

I think it gets cloned as part of the global workflow. Perhaps we can use that somehow?

We could try but then we need to move this test inside the if (WORKFLOW_TESTS) block for tests/soca/CMakeLists.txt

CoryMartin-NOAA commented 4 months ago

Or we can clone wxflow with gdasapp and use relative paths?

RussTreadon-NOAA commented 4 months ago

test_gdasapp_atm_jjob_var & test_gdasapp_atm_jjob_ens The ATM var and ens suite of jobs fail because the submission scripts in test/atm/global-workflow do not properly submit the jobs to run via PBS. WCOSS2 execution winds up in the else block of each jjob_*sh script. For example, jjob_var_run.sh contains

# Execute j-job                                                                                                                                             
if [[ $machine = 'HERA' ]]; then
    sbatch --ntasks=6 --account=$ACCOUNT --qos=batch --time=00:10:00 --export=ALL --wait ${HOMEgfs}/jobs/JGLOBAL_ATM_ANALYSIS_VARIATIONAL
elif [[ $machine = 'ORION' || $machine = 'HERCULES' ]]; then
    sbatch --ntasks=6 --account=$ACCOUNT --qos=batch --time=00:10:00 --export=ALL --wait ${HOMEgfs}/jobs/JGLOBAL_ATM_ANALYSIS_VARIATIONAL
else
    ${HOMEgfs}/jobs/JGLOBAL_ATM_ANALYSIS_VARIATIONAL
fi

An elif [[ $machine = 'WCOSS2' ]]; then block needs to be added to each script.

RussTreadon-NOAA commented 4 months ago

test_gdasapp_aero_gen_3dvar_yaml Add -VV to ctest. Output shows that this job fails because wxflow can not be found

1722: Test command: /lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/test/aero/genyaml_3dvar.sh "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/build/gdas" "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas" "WORKING" "DIRECTORY" "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/build/gdas/test/testrun/"
1722: Test timeout computed to be: 1500
1722: Traceback (most recent call last):
1722:   File "<stdin>", line 1, in <module>
1722: ModuleNotFoundError: No module named 'wxflow'
RussTreadon-NOAA commented 4 months ago

test_gdasapp_soca_socahybridweights, test_gdasapp_soca_incr_handler, test_gdasapp_soca_ens_handler Rerun each test with -VV. Each test fails when trying to execute sbatch. For example, test_gdasapp_soca_ens_handler attempts

1702: Test command: /lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py "-y" "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/build/gdas/test/soca/gw/testrun/run_gdas_apps_ens_handler.yaml" "--skip" "--ctest" "True"
1702: Environment variables: 
1702:  PYTHONPATH=/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/build/gdas/../lib/python3.8:/apps/ops/prod/nco/core/prod_util.v2.0.14/ush:/apps/prod/python-modules/3.8.6/intel/19.1.3.304/lib/python3.8/site-packages
1702: Test timeout computed to be: 1500
1702: {'machine': 'wcoss2', 'ctest command': {'executable': '/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/build/gdas/../bin/gdas_ens_handler.x', 'yaml input': '/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/test/soca/testinput/ens_handler.yaml'}, 'job options': {'account': 'da-cpu', 'qos': 'batch', 'output': 'ens_handler.out', 'nodes': 1, 'ntasks': 1, 'partition': None, 'time': '00:05:00'}}
1702: running sbatch --wait run_jjobs.sh ...
1702: Traceback (most recent call last):
1702:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py", line 309, in <module>
1702:     main()
1702:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py", line 305, in main
1702:     run_card.execute(submit=True)
1702:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py", line 260, in execute
1702:     subprocess.check_output(["sbatch", "--wait", self.name])
1702:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 411, in check_output
1702:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
1702:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 489, in run
1702:     with Popen(*popenargs, **kwargs) as process:
1702:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 854, in __init__
1702:     self._execute_child(args, executable, preexec_fn, close_fds,
1702:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 1702, in _execute_child
1702:     raise child_exception_type(errno_num, err_msg, err_filename)
1702: FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'
RussTreadon-NOAA commented 4 months ago

test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS Add -VV to rerun. Job failed attempting to execute sbatch

1691: machine is wcoss2
1691: gPDY: 20180415
1691: gcyc: 06
1691: assim_freq: 6
1691: RUN: gdas
1691: running sbatch --wait run_jjobs.sh ...
1691: Traceback (most recent call last):
1691:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py", line 309, in <module>
1691:     main()
1691:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py", line 305, in main
1691:     run_card.execute(submit=True)
1691:   File "/lfs/h2/emc/da/noscrub/emc.da/git/global-workflow/wcoss2/sorc/gdas.cd/bundle/gdas/ush/soca/run_jjobs.py", line 260, in execute
1691:     subprocess.check_output(["sbatch", "--wait", self.name])
1691:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 411, in check_output
1691:     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
1691:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 489, in run
1691:     with Popen(*popenargs, **kwargs) as process:
1691:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 854, in __init__
1691:     self._execute_child(args, executable, preexec_fn, close_fds,
1691:   File "/apps/spack/python/3.8.6/intel/19.1.3.304/pjn2nzkjvqgmjw4hmyz43v5x4jbxjzpk/lib/python3.8/subprocess.py", line 1702, in _execute_child
1691:     raise child_exception_type(errno_num, err_msg, err_filename)
1691: FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'

Other test_gdasapp_soca_JGDAS ctests may fail for the same reason. It's also possible that each successive job requires the previous job to have Passed. Thus, if one job fails all the remaining jobs in the chain will fail.

RussTreadon-NOAA commented 4 months ago

I propose to revise the scope of this PR to build and only run some ctests. New issues(s) and PR(s) can be opened to get Failed tests running on WCOSS2.

RussTreadon-NOAA commented 4 months ago

g-w PR #2620 can not move forward until this GDASApp PR is approved and merged into develop. Once this is done, we can update the sorc/gdas.cd hash in g-w PR #2620.

RussTreadon-NOAA commented 4 months ago

@CoryMartin-NOAA , modulefiles/EVA/wcoss2.lua has been added. If you see problems, let me know and I'll fix 'em.

RussTreadon-NOAA commented 4 months ago

Thank you @CoryMartin-NOAA . Merging into develop.