NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 161 forks source link

Update for JCB policies and stage DA job files with Jinja2-templates #2700

Open RussTreadon-NOAA opened 1 week ago

RussTreadon-NOAA commented 1 week ago

Description

This PR updates the gdas.cd hash to bring in new JCB conventions. Resolves #2699

From #2654 This PR will move much of the staging code that take place in the python initialization subroutines of the variational and ensemble DA jobs into Jinja2-templated YAML files to be passed into the wxflow file handler. Much of the staging has already been done this way, but this PR simply expands that strategy.

The old Python routines that were doing this staging are now removed. This is part of a broader refactoring of the pygfs tasking.

wxflow PR #30 is a companion to this PR.

Type of change

Change characteristics

How has this been tested?

Checklist

RussTreadon-NOAA commented 1 week ago

GDASApp ctests and g-w CI testing

Install RussTreadon-NOAA:feature/rename_atm at 39e719d on Dogwood, Hera, Hercules, and Orion. Run GDASApp ctests and g-w C96C48_ufs_hybatmDA.

_Note: local modifications made to C96C48_ufshybatmDA to enable CI on Hera, Hercules, and Orion.

Dogwood (WCOSS2) Not all GDASApp ctests are functional on WCOSS2 due to SLURM assumptions. This is a known issue and will be addressed by future GDASApp issue(s) and PR(s).

54% tests passed, 21 tests failed out of 46

Label Time Summary:
gdas-utils    =   6.73 sec*proc (9 tests)
script        =   6.73 sec*proc (9 tests)

Total Test time (real) = 172.20 sec

The following tests FAILED:
        1751 - test_gdasapp_fv3jedi_fv3inc (Not Run)
        1756 - test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS (Failed)
        1757 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_PREP (Failed)
        1758 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_BMAT (Failed)
        1759 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1760 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN (Failed)
        1761 - test_gdasapp_soca_copy_scratch (Failed)
        1762 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1763 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1764 - test_gdasapp_soca_socahybridweights (Failed)
        1765 - test_gdasapp_soca_incr_handler (Failed)
        1766 - test_gdasapp_soca_ens_handler (Failed)
        1769 - test_gdasapp_snow_apply_jediincr (Failed)
        1770 - test_gdasapp_snow_letkfoi_snowda (Failed)
        1776 - test_gdasapp_atm_jjob_var_run (Failed)
        1777 - test_gdasapp_atm_jjob_var_inc (Failed)
        1778 - test_gdasapp_atm_jjob_var_final (Failed)
        1780 - test_gdasapp_atm_jjob_ens_run (Failed)
        1781 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1782 - test_gdasapp_atm_jjob_ens_final (Failed)
        1783 - test_gdasapp_aero_gen_3dvar_yaml (Failed)

All jobs for g-w C96C48_ufs_hybatmDA CI successfully run to completion

russ.treadon@dlogin09:/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/pratm> rocotostat -d pratm.db -w pratm.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    Jun 20 2024 10:25:07    Jun 20 2024 10:40:12
202402240000        Done    Jun 20 2024 10:25:07    Jun 20 2024 12:42:23

Hera 48 out of 48 GDASApp ctests pass

Test project /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/rename_atm/sorc/gdas.cd/build
      Start 1488: test_gdasapp_util_coding_norms
 1/48 Test #1488: test_gdasapp_util_coding_norms ........................   Passed    2.47 sec
...
      Start 1869: test_gdasapp_aero_gen_3dvar_yaml
48/48 Test #1869: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    0.81 sec

100% tests passed, 0 tests failed out of 48

Label Time Summary:
gdas-utils    =   8.08 sec*proc (11 tests)
script        =   8.08 sec*proc (11 tests)

Total Test time (real) = 2298.85 sec

All jobs for g-w C96C48_ufs_hybatmDA CI successfully run to completion

Hera(hfe08):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pratm$ rocotostat -d pratm.db -w pratm.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Jun 20 2024 09:50:11    Jun 20 2024 10:10:13
202402240000        Done    Jun 20 2024 09:50:11    Jun 20 2024 12:50:12

Hercules Initially 47 out of 48 tests passed. test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN failed because APRUN_OCNANALECEN was not defined. Examine g-w env/HERCULES.env. Find that the ocnanalecen section found in other machine env files was missing from HERCULES.env. Add an ocnanalecen section to HERCULES.env

@@ -135,6 +135,16 @@ case ${step} in
     [[ ${NTHREADS_OCNANAL} -gt ${nth_max} ]] && export NTHREADS_OCNANAL=${nth_max}
     export APRUN_OCNANAL="${launcher} -n ${npe_ocnanalrun} --cpus-per-task=${NTHREADS_OCNANAL}"
  ;;
+"ocnanalecen")
+
+    export APRUNCFP="${launcher} -n \$ncmd ${mpmd_opt}"
+
+    nth_max=$((npe_node_max / npe_node_ocnanalecen))
+
+    export NTHREADS_OCNANALECEN=${nth_ocnanalecen:-${nth_max}}
+    [[ ${NTHREADS_OCNANALECEN} -gt ${nth_max} ]] && export NTHREADS_OCNANALECEN=${nth_max}
+    export APRUN_OCNANALECEN="${launcher} -n ${npe_ocnanalecen} --cpus-per-task=${NTHREADS_OCNANALECEN}"
+;;
  "ocnanalchkpt")

     export APRUNCFP="${launcher} -n \$ncmd ${mpmd_opt}"

Rerun ctests. This time 48 out of 48 GDASApp ctests pass

Test project /work/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build
      Start 1489: test_gdasapp_util_coding_norms
 1/48 Test #1489: test_gdasapp_util_coding_norms ........................   Passed    1.76 sec
...
      Start 1870: test_gdasapp_aero_gen_3dvar_yaml
48/48 Test #1870: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    0.38 sec

100% tests passed, 0 tests failed out of 48

Label Time Summary:
gdas-utils    =  14.93 sec*proc (11 tests)
script        =  14.93 sec*proc (11 tests)

Total Test time (real) = 1637.73 sec

All jobs for g-w C96C48_ufs_hybatmDA CI successfully run to completion

hercules-login-3:/work/noaa/stmp/rtreadon/EXPDIR/pratm$ rocotostat -d pratm.db -w pratm.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Jun 20 2024 10:00:03    Jun 20 2024 10:15:02
202402240000        Done    Jun 20 2024 10:00:03    Jun 20 2024 12:05:03

Orion Add changes needed to compile GDASApp on Orion following Rocky 9 upgrade (see GDASApp PR #1180). Also found it necessary to update g-w workflow/hosts.py and ush/detect_machine.sh (see g-w issue #2695). Updated working copies of these scripts accordingly. After this 48 out of 48 GDASApp ctests pass

Test project /work2/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build
      Start 1489: test_gdasapp_util_coding_norms
 1/48 Test #1489: test_gdasapp_util_coding_norms ........................   Passed    6.23 sec
...
      Start 1870: test_gdasapp_aero_gen_3dvar_yaml
48/48 Test #1870: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    7.02 sec

100% tests passed, 0 tests failed out of 48

Label Time Summary:
gdas-utils    =  39.30 sec*proc (11 tests)
script        =  39.30 sec*proc (11 tests)

Total Test time (real) = 1715.61 sec

No attempt was made to run g-w C96C48_ufs_hybatmDA CI because g-w has not yet been updated to run on Orion following the Rocky 9 upgrade (see g-w issue #2694)

RussTreadon-NOAA commented 1 week ago

GDASApp and CI testing identified three issues with g-w files

RussTreadon-NOAA:feature/rename_atm contains updates to these three files to address the above stated issues.

RussTreadon-NOAA commented 1 week ago

@CoryMartin-NOAA , @DavidNew-NOAA , @guillaumevernieres : this PR is ready for review. The PR

If any of you have time to review your review would be appreciated.

guillaumevernieres commented 1 week ago

You don't want this to be merged in dev/gdasapp @RussTreadon-NOAA ?

RussTreadon-NOAA commented 1 week ago

@guillaumevernieres : I thought @danholdaway 's schematic had us

Once the g-w PR is closed, we rebase dev/gdasapp. RussTreadon-NOAA:feature/rename_atm followed this path.

If we want to merge RussTreadon-NOAA:feature/rename_atm into dev/gdasapp, the draft PR for doing so is #2702.

PR #2702 contains 112 modified files. This PR, #2700, contains 5 modified files.

aerorahul commented 1 week ago

@RussTreadon-NOAA Since #2654 also updates GDASApp hashes, would you be willing to work w/ @DavidNew-NOAA and merge the changes from this PR into #2654? This will expedite testing and merge.

@DavidNew-NOAA If @RussTreadon-NOAA agrees, would you be willing to merge #2700 into #2654 and do a test to confirm the changes are compatible?

Thanks!

DavidNew-NOAA commented 1 week ago

@aerorahul Sure, but #2654 doesn't update any GDASApp hashes

RussTreadon-NOAA commented 1 week ago

@aerorahul and @DavidNew-NOAA : given your comments, I am cloning DavidNew-NOAA:feature/stage_from_yaml on Hera and will run C96C48_ufs_hybatmDA CI. It's good to ensure PR #2654 works as intended before combining PR #2654 and #2700.

RussTreadon-NOAA commented 1 week ago

As documented in PR #2654, C96C48_ufs_hybatmDA CI is not working with DavidNew-NOAA:feature/stage_from_yaml. I will work with @DavidNew-NOAA to figure out what's going on. If we can't get CI to work by tomorrow afternoon, I recommend moving forward with the PR, #2700, as is.

RussTreadon-NOAA commented 1 week ago

With the merger of PR #2654 into this PR, PR #2654 may be closed.

As noted in PR #2654, the changes below must be committed to GDASApp in order to fully exercise the capability added by PR #2654.

RussTreadon-NOAA commented 1 week ago

GDASApp PR #1187 has been opened to add the above four modified files to GDASApp develop.

Once PR #1187 is approved and merged, the gdas.cd hash in feature/rename_atm will be updated.

RussTreadon-NOAA commented 1 week ago

@DavidNew-NOAA and @CoryMartin-NOAA : This PR is ready for final (I hope!) review.

This PR now includes the changes in g-w PR #2654. It also updates the gdas.cd to the current (as of 6/21/2024) head of GDASApp develop (4c58b1e).

RussTreadon-NOAA commented 1 week ago

Thank you @CoryMartin-NOAA for quickly reviewing GDASApp PRs to move this along.

aerorahul commented 1 week ago

@RussTreadon-NOAA Would you be open to adding description from #2654 into this PR? I am happy to update it.

RussTreadon-NOAA commented 1 week ago

Thank you @aerorahul for your note. Yes, I would appreciate your updating the description of this PR with relevant content from PR #2654.

emcbot commented 1 week ago

CI Update on Wcoss2 at 06/21/24 05:33:15 PM
============================================
Cloning and Building global-workflow PR: 2700
with PID: 250979 on host: dlogin08
emcbot commented 1 week ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Fri Jun 21 17:37:44 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/21/24 06:14:55 PM
Case setup: Completed for experiment C48_ATM_7cc86d95
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_7cc86d95
Case setup: Skipped for experiment C48_S2SWA_gefs_7cc86d95
Case setup: Completed for experiment C48_S2SW_7cc86d95
Case setup: Completed for experiment C96_atm3DVar_extended_7cc86d95
Case setup: Skipped for experiment C96_atm3DVar_7cc86d95
Case setup: Skipped for experiment C96_atmaerosnowDA_7cc86d95
Case setup: Completed for experiment C96C48_hybatmDA_7cc86d95
Case setup: Completed for experiment C96C48_ufs_hybatmDA_7cc86d95
emcbot commented 1 week ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_7cc86d95/logs/2021032418/gdasprepoceanobs.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 week ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C48mx500_3DVarAOWCDA_7cc86d95

RussTreadon-NOAA commented 1 week ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_7cc86d95/logs/2021032418/gdasprepoceanobs.log

Follow link here to view the contents of the above file(s): (link)

GDASApp issue #1192 has been opened to address this. The SOCA jobs are not finding wxflow.

emcbot commented 1 week ago

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_7cc86d95/logs/2021122018/gdasprepsnowobs.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 week ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C96_atmaerosnowDA_7cc86d95

CoryMartin-NOAA commented 1 week ago

"AttributeError: 'SnowAnalysis' object has no attribute 'runtime_config'"

CoryMartin-NOAA commented 1 week ago

This needs updated I think: https://github.com/NOAA-EMC/global-workflow/blob/f43a86276aaef91efa28faadc71a3cf50e749efe/scripts/exglobal_prep_snow_obs.py#L24

emcbot commented 1 week ago

Experiment C48_ATM_7cc86d95 SUCCESS on Wcoss2 at 06/21/24 07:24:12 PM

emcbot commented 1 week ago

Experiment C48_S2SW_7cc86d95 SUCCESS on Wcoss2 at 06/21/24 07:44:14 PM

RussTreadon-NOAA commented 1 week ago

This needs updated I think:

https://github.com/NOAA-EMC/global-workflow/blob/f43a86276aaef91efa28faadc71a3cf50e749efe/scripts/exglobal_prep_snow_obs.py#L24

I do not see where or how runtime_config.cyc gets associated with the SnowAnalysis object

aerorahul commented 1 week ago

This needs updated I think: https://github.com/NOAA-EMC/global-workflow/blob/f43a86276aaef91efa28faadc71a3cf50e749efe/scripts/exglobal_prep_snow_obs.py#L24

I do not see where or how runtime_config.cyc gets associated with the SnowAnalysis object

That should become task_config.cyc.

emcbot commented 1 week ago

Experiment C96C48_hybatmDA_7cc86d95 SUCCESS on Wcoss2 at 06/21/24 08:24:25 PM

emcbot commented 1 week ago

Experiment C96C48_ufs_hybatmDA_7cc86d95 SUCCESS on Wcoss2 at 06/21/24 08:28:20 PM

RussTreadon-NOAA commented 1 week ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_7cc86d95/logs/2021032418/gdasprepoceanobs.log

Follow link here to view the contents of the above file(s): (link)

GDASApp issue #1192 has been opened to address this. The SOCA jobs are not finding wxflow.

It's a bit more complicated that just wxflow. Many soca scripts in gdas.cd/ush/soca use runtime_config. This is no longer valid. It should be task_config.


Hera(hfe06):/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/pr2700/sorc/gdas.cd/ush$ grep -r runtime_config .
./soca/prep_ocean_obs.py:        PDY = self.runtime_config['PDY']
./soca/prep_ocean_obs.py:        cyc = self.runtime_config['cyc']
./soca/prep_ocean_obs.py:        self.runtime_config['cdate'] = cdate
./soca/prep_ocean_obs.py:        cdate = self.runtime_config['cdate']
./soca/prep_ocean_obs.py:        RUN = self.runtime_config.RUN
./soca/prep_ocean_obs.py:        cyc = self.runtime_config['cyc']
./soca/prep_ocean_obs.py:        ocean_mask_dest = os.path.join(self.runtime_config.DATA, 'RECCAP2_region_masks_all_v20221025.nc')
./soca/prep_ocean_obs.py:                                                                     self.runtime_config,
./soca/prep_ocean_obs.py:        chdir(self.runtime_config.DATA)
./soca/prep_ocean_obs.py:        RUN = self.runtime_config.RUN
./soca/prep_ocean_obs.py:        cyc = self.runtime_config.cyc
Binary file ./soca/__pycache__/prep_ocean_obs.cpython-310.pyc matches
Binary file ./soca/__pycache__/prep_ocean_obs_utils.cpython-310.pyc matches
Binary file ./soca/__pycache__/marine_recenter.cpython-310.pyc matches
./soca/marine_recenter.py:        PDY = self.runtime_config['PDY']
./soca/marine_recenter.py:        cyc = self.runtime_config['cyc']
./soca/marine_recenter.py:        DATA = self.runtime_config.DATA
./soca/marine_recenter.py:        self.runtime_config['gcyc'] = gdate.strftime("%H")
./soca/marine_recenter.py:        self.runtime_config['gPDY'] = datetime(gdate.year,
./soca/marine_recenter.py:             'dump': self.runtime_config.RUN,
./soca/marine_recenter.py:        RUN = self.runtime_config.RUN
./soca/marine_recenter.py:        gcyc = self.runtime_config.gcyc
./soca/marine_recenter.py:        bkg_utils.stage_ic(self.config.bkg_dir, self.runtime_config.DATA, gcyc)
./soca/marine_recenter.py:        gPDYstr = self.runtime_config.gPDY.strftime("%Y%m%d")
./soca/marine_recenter.py:        chdir(self.runtime_config.DATA)
./soca/marine_recenter.py:        RUN = self.runtime_config.RUN
./soca/marine_recenter.py:        cyc = self.runtime_config.cyc
./soca/marine_recenter.py:        PDYstr = self.runtime_config.PDY.strftime("%Y%m%d")
./soca/prep_ocean_obs_utils.py:def obs_fetch(config, runtime_config, obsprep_space, cycles):
./soca/prep_ocean_obs_utils.py:    RUN = runtime_config.RUN
./soca/prep_ocean_obs_utils.py:    PDY = runtime_config.PDY
./soca/prep_ocean_obs_utils.py:    cyc = runtime_config.cyc
Hera(hfe06):/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/pr2700/sorc/gdas.cd/ush$ cd ..
Hera(hfe06):/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/pr2700/sorc/gdas.cd$ grep -r runtime_config ush/
ush/soca/prep_ocean_obs.py:        PDY = self.runtime_config['PDY']
ush/soca/prep_ocean_obs.py:        cyc = self.runtime_config['cyc']
ush/soca/prep_ocean_obs.py:        self.runtime_config['cdate'] = cdate
ush/soca/prep_ocean_obs.py:        cdate = self.runtime_config['cdate']
ush/soca/prep_ocean_obs.py:        RUN = self.runtime_config.RUN
ush/soca/prep_ocean_obs.py:        cyc = self.runtime_config['cyc']
ush/soca/prep_ocean_obs.py:        ocean_mask_dest = os.path.join(self.runtime_config.DATA, 'RECCAP2_region_masks_all_v20221025.nc')
ush/soca/prep_ocean_obs.py:                                                                     self.runtime_config,
ush/soca/prep_ocean_obs.py:        chdir(self.runtime_config.DATA)
ush/soca/prep_ocean_obs.py:        RUN = self.runtime_config.RUN
ush/soca/prep_ocean_obs.py:        cyc = self.runtime_config.cyc
Binary file ush/soca/__pycache__/prep_ocean_obs.cpython-310.pyc matches
Binary file ush/soca/__pycache__/prep_ocean_obs_utils.cpython-310.pyc matches
Binary file ush/soca/__pycache__/marine_recenter.cpython-310.pyc matches
ush/soca/marine_recenter.py:        PDY = self.runtime_config['PDY']
ush/soca/marine_recenter.py:        cyc = self.runtime_config['cyc']
ush/soca/marine_recenter.py:        DATA = self.runtime_config.DATA
ush/soca/marine_recenter.py:        self.runtime_config['gcyc'] = gdate.strftime("%H")
ush/soca/marine_recenter.py:        self.runtime_config['gPDY'] = datetime(gdate.year,
ush/soca/marine_recenter.py:             'dump': self.runtime_config.RUN,
ush/soca/marine_recenter.py:        RUN = self.runtime_config.RUN
ush/soca/marine_recenter.py:        gcyc = self.runtime_config.gcyc
ush/soca/marine_recenter.py:        bkg_utils.stage_ic(self.config.bkg_dir, self.runtime_config.DATA, gcyc)
ush/soca/marine_recenter.py:        gPDYstr = self.runtime_config.gPDY.strftime("%Y%m%d")
ush/soca/marine_recenter.py:        chdir(self.runtime_config.DATA)
ush/soca/marine_recenter.py:        RUN = self.runtime_config.RUN
ush/soca/marine_recenter.py:        cyc = self.runtime_config.cyc
ush/soca/marine_recenter.py:        PDYstr = self.runtime_config.PDY.strftime("%Y%m%d")
ush/soca/prep_ocean_obs_utils.py:def obs_fetch(config, runtime_config, obsprep_space, cycles):
ush/soca/prep_ocean_obs_utils.py:    RUN = runtime_config.RUN
ush/soca/prep_ocean_obs_utils.py:    PDY = runtime_config.PDY
ush/soca/prep_ocean_obs_utils.py:    cyc = runtime_config.cyc
emcbot commented 1 week ago

CI Passed Hercules at
Built and ran in directory /work2/noaa/stmp/CI/HERCULES/2700

emcbot commented 1 week ago

Experiment C96_atm3DVar_extended_7cc86d95 SUCCESS on Wcoss2 at 06/22/24 03:56:30 AM

emcbot commented 1 week ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_7cc86d95 *** SUCCESS *** at 06/21/24 07:24:12 PM
Experiment C48_S2SW_7cc86d95 *** SUCCESS *** at 06/21/24 07:44:14 PM
Experiment C96C48_hybatmDA_7cc86d95 *** SUCCESS *** at 06/21/24 08:24:25 PM
Experiment C96C48_ufs_hybatmDA_7cc86d95 *** SUCCESS *** at 06/21/24 08:28:20 PM
Experiment C96_atm3DVar_extended_7cc86d95 *** SUCCESS *** at 06/22/24 03:56:30 AM
RussTreadon-NOAA commented 1 week ago

Updates to two files committed to RussTreadon-NOAA:feature/rename_atm at 86631234

These changes along with changes documented in GDASApp PR #1195 to restore all GDASApp ctests to Passed state. These changes also get failed g-w CI for C96_atmaerosnowDA and C48mx500_3DVarAOWCDA past previously failed jobs.

NOTE: g-w CI should not be rerun until GDASApp PR #1195 is merged into GDASApp develop and the sorc/gdas.cd hash updated in RussTreadon-NOAA:feature/rename_atm

RussTreadon-NOAA commented 1 week ago

All jobs from g-w C48mx500_3DVarAOWCDA CI successfully ran to completion on Hera.

Hera(hfe06):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_wcda$ rocotostat -d pr2700_wcda.db -w pr2700_wcda.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Jun 22 2024 16:38:05    Jun 22 2024 17:50:15
202103241800        Done    Jun 22 2024 16:38:05    Jun 22 2024 19:45:14
RussTreadon-NOAA commented 1 week ago

All jobs from g-w C96_atmaerosnowDA CI successfully ran to completion on Hera

Hera(hfe06):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_aero$ rocotostat -d pr2700_aero.db -w pr2700_aero.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201200        Done    Jun 22 2024 17:47:53    Jun 22 2024 18:15:11
202112201800        Done    Jun 22 2024 17:47:53    Jun 22 2024 19:45:12
202112210000        Done    Jun 22 2024 17:47:53    Jun 22 2024 21:35:10
RussTreadon-NOAA commented 1 week ago

All jobs from C96C48_ufs_hybatmDA CI successfully ran to completion on Hera

Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_ufsda$ rocotostat -d pr2700_ufsda.db -w pr2700_ufsda.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Jun 22 2024 21:15:22    Jun 22 2024 21:40:23
202402240000        Done    Jun 22 2024 21:15:22    Jun 23 2024 02:05:12

All jobs from C96C48_hybatmDA CI successfully ran to completion on Hera

Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_gsida$ rocotostat -d pr2700_gsida.db -w pr2700_gsida.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202112201800        Done    Jun 22 2024 21:15:24    Jun 22 2024 21:40:25
202112210000        Done    Jun 22 2024 21:15:24    Jun 23 2024 00:15:16
202112210600        Done    Jun 22 2024 21:15:24    Jun 23 2024 01:00:23
RussTreadon-NOAA commented 1 week ago

@WalterKolczynski-NOAA , the gdas.cd hash was updated at 19f35e9. As reported above wcda, aerosnow, ufsda, and gsida CI successfully run on Hera.

However, you may opt to pause triggering new g-w CI until the team decides whether or not to include additional wxflow clean up in this PR.

RussTreadon-NOAA commented 6 days ago

@aerorahul, @WalterKolczynski-NOAA @CoryMartin-NOAA , & @DavidNew-NOAA , the changes in this PR may be reviewed.

I do not plan on making any more changes to this PR apart from

RussTreadon-NOAA commented 6 days ago

The gdas.cd hash has been updated. Absent change request(s) from reviewers, this PR is ready for final CI testing.

emcbot commented 6 days ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C48mx500_3DVarAOWCDA_d1d88a6d

TerrenceMcGuinness-NOAA commented 6 days ago

/scratch1/NCEPDEV/global/glopara/dump/gdas.20210324/18/atmos/gdas.t18z.updated.status.tm00.bufr_d does not exist

Terry.McGuinness (hfe03) C48mx500_3DVarAOWCDA_d1d88a6d $ rocotocheck -w C48mx500_3DVarAOWCDA_d1d88a6d.xml -d C48mx500_3DVarAOWCDA_d1d88a6d.db -c 202103241800 -t gdasprep

Task: gdasprep
  account: nems
  command: /scratch1/NCEPDEV/global/CI/2700/gfs/jobs/rocoto/prep.sh
  cores: 4
  cycledefs: gdas
  final: false
  jobname: C48mx500_3DVarAOWCDA_d1d88a6d_gdasprep_18
  join: /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_d1d88a6d/logs/2021032418/gdasprep.log
  maxtries: 2
  memory: 40GB
  name: gdasprep
  nodes: 2:ppn=2:tpp=1
  partition: hera
  queue: batch
  throttle: 9999999
  walltime: 00:30:00
  environment
    CDATE ==> 2021032418
    CDUMP ==> gdas
    COMROOT ==> /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT
    DATAROOT ==> /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C48mx500_3DVarAOWCDA_d1d88a6d
    EXPDIR ==> /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/EXPDIR/C48mx500_3DVarAOWCDA_d1d88a6d
    HOMEgfs ==> /scratch1/NCEPDEV/global/CI/2700/gfs
    NET ==> gfs
    PDY ==> 20210324
    RUN ==> gdas
    RUN_ENVIR ==> emc
    cyc ==> 18
  dependencies
    AND is not satisfied
      SOME is satisfied
        gdasatmos_prod_f000 of cycle 202103241200 is SUCCEEDED
        gdasatmos_prod_f003 of cycle 202103241200 is SUCCEEDED
        gdasatmos_prod_f006 of cycle 202103241200 is SUCCEEDED
        gdasatmos_prod_f009 of cycle 202103241200 is SUCCEEDED
      /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_d1d88a6d/gdas.20210324/12//model_data/atmos/history/gdas.t12z.atmf009.nc is available
      /scratch1/NCEPDEV/global/glopara/dump/gdas.20210324/18/atmos/gdas.t18z.updated.status.tm00.bufr_d does not exist

Cycle: 202103241800
  Valid for this task: YES
  State: active
  Activated: 2024-06-24 16:03:06 UTC
  Completed: -
  Expired: -

Job: This task has not been submitted for this cycle

Task can not be submitted because:
  Dependencies are not satisfied
RussTreadon-NOAA commented 6 days ago

/scratch1/NCEPDEV/global/glopara/dump/gdas.20210324 was removed as part of routine GDA disk management.

@KateFriedman-NOAA , can this dump directory be restored on Hera to allow g-w C48mx500_3DVarAOWCDA CI to run?

RussTreadon-NOAA commented 6 days ago

The 20240224 00Z gdas and gfs atmanlupp jobs died on WCOSS2 (Dogwood) with the error message

+ JGLOBAL_ATMOS_UPP[22]: /lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/rename_atm/scripts/exglobal_atmos_upp.py
Traceback (most recent call last):
  File "/lfs/h2/emc/da/noscrub/russ.treadon/git/global-workflow/rename_atm/scripts/exglobal_atmos_upp.py", line 6, in <module>
    from pygfs.task.upp import UPP
ModuleNotFoundError: No module named 'pygfs'

Examination of jobs/roccoto/upp.sh shows the load_fv3gfs_modules.sh is NOT executed on WCOSS2.

# Source FV3GFS workflow modules                                                                                                                                                                  
#. "${HOMEgfs}/ush/load_fv3gfs_modules.sh"                                                                                                                                                        
#status=$?                                                                                                                                                                                        
#if (( status != 0 )); then exit "${status}"; fi                                                                                                                                                  
# Temporarily load modules from UPP on WCOSS2                                                                                                                                                     
source "${HOMEgfs}/ush/detect_machine.sh"
if [[ "${MACHINE_ID}" = "wcoss2" ]]; then
  set +x
  source "${HOMEgfs}/ush/module-setup.sh"
  module use "${HOMEgfs}/sorc/ufs_model.fd/FV3/upp/modulefiles"
  module load "${MACHINE_ID}"
  module load prod_util
  module load cray-pals
  module load cfp
  module load libjpeg
  module load grib_util/1.2.3
  module load wgrib2/2.0.8
  export WGRIB2=wgrib2
  module load python/3.8.6
  module load crtm/2.4.0  # TODO: This is only needed when UPP_RUN=goes.  Is there a better way to handle this?                                                                                   
  set_trace
else
  . "${HOMEgfs}/ush/load_fv3gfs_modules.sh"
  status=$?
  if (( status != 0 )); then exit "${status}"; fi
fi

Given this, add the following to the WCOSS2 section of upp.sh

@@ -29,6 +29,12 @@ if [[ "${MACHINE_ID}" = "wcoss2" ]]; then
   module load python/3.8.6
   module load crtm/2.4.0  # TODO: This is only needed when UPP_RUN=goes.  Is there a better way to handle this?
   set_trace
+
+  # Add wxflow to PYTHONPATH
+  wxflowPATH="${HOMEgfs}/ush/python"
+  PYTHONPATH="${PYTHONPATH:+${PYTHONPATH}:}${HOMEgfs}/ush:${wxflowPATH}"
+  export PYTHONPATH
+  
 else
   . "${HOMEgfs}/ush/load_fv3gfs_modules.sh"
   status=$?

With this change in place the gdas and gdas atmanlupp jobs ran to completion on WCOSS2.

Change committed to RussTreadon-NOAA:feature/rename_atm at 8fc02e2.

KateFriedman-NOAA commented 6 days ago

/scratch1/NCEPDEV/global/glopara/dump/gdas.20210324 was removed as part of routine GDA disk management.

@KateFriedman-NOAA , can this dump directory be restored on Hera to allow g-w C48mx500_3DVarAOWCDA CI to run?

@RussTreadon-NOAA Sorry about that! The dump data for 20210324 has been filled back in on Hera. I have made a note to not remove this date in future age-offs now. Let me know if you have any issues with this dump data.

emcbot commented 6 days ago

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f054.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f057.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f060.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f063.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 6 days ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C96_atmaerosnowDA_d1d88a6d

RussTreadon-NOAA commented 6 days ago

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f054.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f057.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f060.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f063.log

Follow link here to view the contents of the above file(s): (link)

@WalterKolczynski-NOAA . Each of the cited log files contains a disk quota exceeded message

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f054.log.0:cat: write error: Disk quota exceeded
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f057.log.0:cat: write error: Disk quota exceeded
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f060.log.0:cat: write error: Disk quota exceeded
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_d1d88a6d/logs/2021122100/gfsatmos_prod_f063.log.0:cat: write error: Disk quota exceeded
RussTreadon-NOAA commented 5 days ago

@WalterKolczynski-NOAA : g-w C48mx500_3DVarAOWCDA CI successfully ran to completion on Hera during the morning of 6/25/2024 using the role.jedipara account

Hera(hfe02):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_wcda$ rocotostat -d pr2700_wcda.db -w pr2700_wcda.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Jun 25 2024 09:45:12    Jun 25 2024 10:00:34
202103241800        Done    Jun 25 2024 09:45:12    Jun 25 2024 10:46:18