NOAA-EMC / GDASApp

Global Data Assimilation System Application
GNU Lesser General Public License v2.1
14 stars 28 forks source link

Updates to GDASapp to account for new JCB policies #1144

Closed danholdaway closed 2 weeks ago

danholdaway commented 1 month ago

Once approved I will merge the JCB PRs and then we can merge this after updating the hashes.

A G-W PR will follow the approval and merge of this PR.

G-W branch required for testing: https://github.com/danholdaway/global-workflow/tree/feature/rename_atm

RussTreadon-NOAA commented 1 month ago

Orion test Install feature/rename_atm at 2ec1767 inside g-w devleop at c44d0ac8. Initial run of test_gdasapp yielded the following ctest failures

64% tests passed, 17 tests failed out of 47

Label Time Summary:
gdas-utils    =   4.10 sec*proc (9 tests)
script        =   4.10 sec*proc (9 tests)

Total Test time (real) = 1370.19 sec

The following tests FAILED:
        1756 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
        1757 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
        1758 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1761 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1762 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1763 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
        1764 - test_gdasapp_soca_socahybridweights (Failed)
        1765 - test_gdasapp_soca_incr_handler (Failed)
        1766 - test_gdasapp_soca_ens_handler (Failed)
        1775 - test_gdasapp_atm_jjob_var_init (Failed)
        1776 - test_gdasapp_atm_jjob_var_run (Failed)
        1777 - test_gdasapp_atm_jjob_var_inc (Failed)
        1778 - test_gdasapp_atm_jjob_var_final (Failed)
        1779 - test_gdasapp_atm_jjob_ens_init (Failed)
        1780 - test_gdasapp_atm_jjob_ens_run (Failed)
        1781 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1782 - test_gdasapp_atm_jjob_ens_final (Failed)

Need to update g-w sorc/jcb to be consistent with jcb-algorithms and jcb-gdas used in sorc/gdas.cd/parm. Update working copy of sorc/jcb to jcb branch feature/rename_atm at d167c6c. Rerun test_gdasapp. More ctests pass. Failures are limited to soca tests

81% tests passed, 9 tests failed out of 47

Label Time Summary:
gdas-utils    =   4.66 sec*proc (9 tests)
script        =   4.66 sec*proc (9 tests)

Total Test time (real) = 1787.51 sec

The following tests FAILED:
        1756 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
        1757 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
        1758 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1761 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1762 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1763 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
        1764 - test_gdasapp_soca_socahybridweights (Failed)
        1765 - test_gdasapp_soca_incr_handler (Failed)
        1766 - test_gdasapp_soca_ens_handler (Failed)

Some of these failures may flip to Passed if the initial test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP failure is resolved.

Took a peek at /work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build/gdas/test/soca/gw/testrun/JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP.out.

The traceback flagged gdas.t12z.insitu_surface_trkob.2018041512.nc4 as a missing file. This, however, may not be the cause of the failure. The traceback ends with

  File "/work2/noaa/da/rtreadon/git/global-workflow/rename_atm/parm/gdas/jcb-algorithms/3dfgat.yaml.j2", line 45, in top-level template code
    {% include observation_from_jcb + '.yaml.j2' %}
  File "/work2/noaa/da/python/opt/core/miniconda3/4.6.14/envs/gdasapp/lib/python3.7/site-packages/jinja2/loaders.py", line 218, in get_source
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: insitu_profile_bathy.yaml.j2
+ JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP[1]: postamble JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP 1717527206 1
danholdaway commented 1 month ago

Thanks for testing @RussTreadon-NOAA. Passing tests require https://github.com/danholdaway/global-workflow/tree/feature/rename_atm branch of global-workflow. You should be able to switch branch and run the tests again without a rebuild since it's just a change to jcb and config that matters.

danholdaway commented 1 month ago

Ah I see you did that. Perhaps the failure is because of more observations added to obs_list without those YAMLs going to JCB as well. Let me check.

danholdaway commented 1 month ago

@RussTreadon-NOAA I fixed that failure by adding the insitu YAML files to JCB. There should be zero downstream impact on the other tests, which should all pass again.

emcbot commented 4 weeks ago

Automated Global-Workflow GDASApp Testing Results: Machine: orion

Start: Thu Jun  6 13:20:48 CDT 2024 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Thu Jun  6 14:10:32 CDT 2024
---------------------------------------------------
Tests:                                  *Failed*
Tests: Failed at Thu Jun  6 14:25:32 CDT 2024
Tests: 64% tests passed, 17 tests failed out of 47
    1842 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
    1843 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
    1844 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
    1847 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
    1848 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
    1849 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
    1850 - test_gdasapp_soca_socahybridweights (Failed)
    1851 - test_gdasapp_soca_incr_handler (Failed)
    1852 - test_gdasapp_soca_ens_handler (Failed)
    1861 - test_gdasapp_atm_jjob_var_init (Failed)
    1862 - test_gdasapp_atm_jjob_var_run (Failed)
    1863 - test_gdasapp_atm_jjob_var_inc (Failed)
    1864 - test_gdasapp_atm_jjob_var_final (Failed)
    1865 - test_gdasapp_atm_jjob_ens_init (Failed)
    1866 - test_gdasapp_atm_jjob_ens_run (Failed)
    1867 - test_gdasapp_atm_jjob_ens_inc (Failed)
    1868 - test_gdasapp_atm_jjob_ens_final (Failed)
Tests: see output at /work2/noaa/stmp/cmartin/CI/GDASApp/workflow/PR/1144/global-workflow/sorc/gdas.cd/build/log.ctest
RussTreadon-NOAA commented 4 weeks ago

Updated branches and submodules in /work2/noaa/da/rtreadon/git/global-workflow/rename_atm. (Note: I am working in a locally modified copy of g-w develop.)

45 out of 47 tests pass

96% tests passed, 2 tests failed out of 47

Label Time Summary:
gdas-utils    =   9.85 sec*proc (9 tests)
script        =   9.85 sec*proc (9 tests)

Total Test time (real) = 1953.63 sec

The following tests FAILED:
        1836 - test_gdasapp_fv3jedi_fv3inc (Failed)
        1849 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)

A rerun of test_gdasapp_fv3jedi_fv3inc passed

(gdasapp) Orion-login-2:/work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build$ ctest -R test_gdasapp_fv3jedi_fv3inc
Test project /work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build
    Start 1836: test_gdasapp_fv3jedi_fv3inc
1/1 Test #1836: test_gdasapp_fv3jedi_fv3inc ......   Passed    5.43 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   5.98 sec

I can not explain why the first run failed. The test includes a reference check. Is it possible that fv3jedi_fv3inc test results are not bitwise identical from one run to the next? What do you think @DavidNew-NOAA?

A check of the log file for test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY shows

2024-06-06 20:49:14:INFO:Loading input YAML from preevayamls/eva_insitu_profile_tesac_salinity_2018041512.yaml
Traceback (most recent call last):
  File "/work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/scripts/exgdas_global_marine_analysis_vrfy.py", line 187, in <module>
    marine_eva_post.marine_eva_post(infile, 'evayamls', diagdir)
  File "/work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/ush/eva/marine_eva_post.py", line 39, in marine_eva_post
    layer['vmin'] = vminmax[variable]['vmin']
KeyError: 'salinity'
+ JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY[1]: postamble JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY 1717706755 1
+ preamble.sh[70]: set +x
End JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY at 20:49:15 with error code 1 (time elapsed: 00:03:20)
+ Unknown[1]: postamble slurm_script 1717706753 1

My working copy of feature/rename_atm may not be consistent with the current state of g-w develop and/or other repositories.

emcbot commented 4 weeks ago

Automated Global-Workflow GDASApp Testing Results: Machine: hera

Start: Thu Jun  6 18:24:35 UTC 2024 on hfe06
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Thu Jun  6 19:11:22 UTC 2024
---------------------------------------------------
Tests:                                  *Failed*
Tests: Failed at Thu Jun  6 21:10:13 UTC 2024
Tests: 64% tests passed, 17 tests failed out of 47
    1841 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
    1842 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
    1843 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
    1846 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
    1847 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
    1848 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)
    1849 - test_gdasapp_soca_socahybridweights (Failed)
    1850 - test_gdasapp_soca_incr_handler (Failed)
    1851 - test_gdasapp_soca_ens_handler (Failed)
    1860 - test_gdasapp_atm_jjob_var_init (Failed)
    1861 - test_gdasapp_atm_jjob_var_run (Failed)
    1862 - test_gdasapp_atm_jjob_var_inc (Failed)
    1863 - test_gdasapp_atm_jjob_var_final (Failed)
    1864 - test_gdasapp_atm_jjob_ens_init (Failed)
    1867 - test_gdasapp_atm_jjob_ens_final (Failed)
Tests: see output at /scratch1/NCEPDEV/da/Cory.R.Martin/CI/GDASApp/workflow/PR/1144/global-workflow/sorc/gdas.cd/build/log.ctest
DavidNew-NOAA commented 4 weeks ago

@RussTreadon-NOAA There shouldn't be a reference check of the input/output state for the fv3inc tests. However, there were some significant updates to Global Workflow on Wednesday, and I was getting some similar errors yesterday until I updates all my repos

DavidNew-NOAA commented 4 weeks ago

@RussTreadon-NOAA Sorry, I take that back. There is a reference check for test_gdasapp_fv3jedi_fv3inc. I was thinking of test_gdasapp_atm_jjob_var_inc and test_gdasapp_atm_jjob_ens_inc. Eventually I wish to add reference checks for those two jobs which will make test_gdasapp_fv3jedi_fv3inc somewhat redundant. Anyways, I'm not sure about whether things are bitwise identical from one run to the next. I would hope they are on the same machine.

RussTreadon-NOAA commented 3 weeks ago

Orion tests

Install forked g-w feature/rename_atm at 09ec021 on Orion. This fork clones GDASApp feature/rename_atm into g-w sorc/gdas.cd. Update cloned sorc/gdas.cd to current head, 72beb13, of GDASApp feature/rename_atm.

Run GDASApp ctests. 47 out of 47 tests pass

Test project /work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build
      Start 1489: test_gdasapp_util_coding_norms
 1/47 Test #1489: test_gdasapp_util_coding_norms ........................   Passed    8.84 sec
...
      Start 1869: test_gdasapp_aero_gen_3dvar_yaml
47/47 Test #1869: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    0.26 sec

100% tests passed, 0 tests failed out of 47

Label Time Summary:
gdas-utils    =  16.49 sec*proc (9 tests)
script        =  16.49 sec*proc (9 tests)

Total Test time (real) = 1521.82 sec

Run g-w CI C96C48_ufs_hybatmDA. The following jobs failed

202402240000           gdasatmanlvar                    18266389                DEAD                   1         2         102.0
202402240000            gfsatmanlvar                    18266345                DEAD                   1         2          66.0

due to

 6: GSI grid: number of processor in layout does not match number in communicator
12: GSI grid: number of processor in layout does not match number in communicator
14: GSI grid: number of processor in layout does not match number in communicator
18: GSI grid: number of processor in layout does not match number in communicator

Examination of the input yaml found

        saber central block:
          saber block name: gsi static covariance
          read:
            gsi akbk: ./fv3jedi/akbk.nc4
            gsi error covariance file: /work/noaa/stmp/rtreadon/ORION/RUNDIRS/prename/gfsatmanl_00/berror/gsi-coeffs-gfs-global.nc4
            gsi berror namelist file: /work/noaa/stmp/rtreadon/ORION/RUNDIRS/prename/gfsatmanl_00/berror/gfs_gsi_global.nml
            processor layout x direction: 12
            processor layout y direction: 8
            debugging mode: false
        saber outer blocks:
        - saber block name: gsi interpolation to model grid
          gsi akbk: ./fv3jedi/akbk.nc4
          gsi error covariance file: /work/noaa/stmp/rtreadon/ORION/RUNDIRS/prename/gfsatmanl_00/berror/gsi-coeffs-gfs-global.nc4
          gsi berror namelist file: /work/noaa/stmp/rtreadon/ORION/RUNDIRS/prename/gfsatmanl_00/berror/gfs_gsi_global.nml
          processor layout x direction: 12
          processor layout y direction: 12
          debugging mode: false

The [12,8] layout is correct. The [12,12] layout is wrong. Trace this to a typo in parm/jcb-gdas/model/atmosphere/atmosphere_background_error_hybrid_gsibec_bump.yaml.j2

       gsi berror namelist file: {{atmosphere_gsibec_path}}/gfs_gsi_global.nml
       processor layout x direction: {{atmosphere_layout_gsib_x}}
-      processor layout y direction: {{atmosphere_layout_gsib_x}}
+      processor layout y direction: {{atmosphere_layout_gsib_y}}

Note this in jcb PR #15.

Correct typo in working copy. GDASApp-base DA jobs successfully ran to completion. All jobs now complete

(gdasapp) Orion-login-4:/work2/noaa/stmp/rtreadon/EXPDIR/prename$ rocotostat -d prename.db -w prename.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Jun 10 2024 14:40:15    Jun 10 2024 15:00:30
202402240000        Done    Jun 10 2024 14:40:15    Jun 10 2024 18:01:42
danholdaway commented 3 weeks ago

Thanks @RussTreadon-NOAA, feel free to push the required changes to the PRs

RussTreadon-NOAA commented 3 weeks ago

Need to update parm/jcb-gdas hash to a5d0277. Should also bring feature/rename_atm up to date with current head of GDASApp develop.

emcbot commented 2 weeks ago

Automated GDASApp Testing Results: Machine: hera

Start: Tue Jun 18 19:46:12 UTC 2024 on hfe03
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Tue Jun 18 20:40:01 UTC 2024
---------------------------------------------------
Tests:                                 *SUCCESS*
Tests: Completed at Tue Jun 18 20:41:58 UTC 2024
Tests: 100% tests passed, 0 tests failed out of 24
emcbot commented 2 weeks ago

Automated Global-Workflow GDASApp Testing Results: Machine: hera

Start: Tue Jun 18 19:52:55 UTC 2024 on hfe03
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Tue Jun 18 20:47:51 UTC 2024
---------------------------------------------------
Tests:                                  *Failed*
Tests: Failed at Tue Jun 18 21:09:36 UTC 2024
Tests: 67% tests passed, 16 tests failed out of 48
    1843 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
    1844 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
    1845 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
    1848 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
    1849 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
    1850 - test_gdasapp_soca_socahybridweights (Failed)
    1851 - test_gdasapp_soca_incr_handler (Failed)
    1852 - test_gdasapp_soca_ens_handler (Failed)
    1861 - test_gdasapp_atm_jjob_var_init (Failed)
    1862 - test_gdasapp_atm_jjob_var_run (Failed)
    1863 - test_gdasapp_atm_jjob_var_inc (Failed)
    1864 - test_gdasapp_atm_jjob_var_final (Failed)
    1865 - test_gdasapp_atm_jjob_ens_init (Failed)
    1866 - test_gdasapp_atm_jjob_ens_run (Failed)
    1867 - test_gdasapp_atm_jjob_ens_inc (Failed)
    1868 - test_gdasapp_atm_jjob_ens_final (Failed)
Tests: see output at /scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/workflow/PR/1144/global-workflow/sorc/gdas.cd/build/log.ctest
emcbot commented 2 weeks ago

Automated Global-Workflow GDASApp Testing Results: Machine: hera

Start: Tue Jun 18 22:05:37 UTC 2024 on hfe04
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Tue Jun 18 22:48:47 UTC 2024
---------------------------------------------------
Tests:                                  *Failed*
Tests: Failed at Tue Jun 18 23:09:55 UTC 2024
Tests: 67% tests passed, 16 tests failed out of 48
    1843 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
    1844 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
    1845 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
    1848 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
    1849 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
    1850 - test_gdasapp_soca_socahybridweights (Failed)
    1851 - test_gdasapp_soca_incr_handler (Failed)
    1852 - test_gdasapp_soca_ens_handler (Failed)
    1861 - test_gdasapp_atm_jjob_var_init (Failed)
    1862 - test_gdasapp_atm_jjob_var_run (Failed)
    1863 - test_gdasapp_atm_jjob_var_inc (Failed)
    1864 - test_gdasapp_atm_jjob_var_final (Failed)
    1865 - test_gdasapp_atm_jjob_ens_init (Failed)
    1866 - test_gdasapp_atm_jjob_ens_run (Failed)
    1867 - test_gdasapp_atm_jjob_ens_inc (Failed)
    1868 - test_gdasapp_atm_jjob_ens_final (Failed)
Tests: see output at /scratch1/NCEPDEV/da/role.jedipara/CI/GDASApp/workflow/PR/1144/global-workflow/sorc/gdas.cd/build/log.ctest
emcbot commented 2 weeks ago

Automated Global-Workflow GDASApp Testing Results: Machine: hera

Start: Wed Jun 19 00:38:56 UTC 2024 on hfe04
---------------------------------------------------
Build:                                 *SUCCESS*
Build: Completed at Wed Jun 19 01:24:03 UTC 2024
---------------------------------------------------
Tests:                                 *SUCCESS*
Tests: Completed at Wed Jun 19 01:49:26 UTC 2024
Tests: 100% tests passed, 0 tests failed out of 48
RussTreadon-NOAA commented 2 weeks ago

@CoryMartin-NOAA, @guillaumevernieres, and @DavidNew-NOAA - the changes in this PR have been tested via g-w PR #2700 with acceptable results. I'd like to merge this PR in GDASApp develop. Any objections?

@danholdaway has three JCB PRs related to GDASApp PR #1144 (this PR):

Each of the jcb PRs have been approved by Cory, David, and Russ. These jcb PRs should also be merged into their respective develop. Are we OK with doing so?

DavidNew-NOAA commented 2 weeks ago

@RussTreadon-NOAA No objection here