NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Stage variational and ensemble DA job files with Jinja2-templated YAMLs #2654

Closed DavidNew-NOAA closed 1 week ago

DavidNew-NOAA commented 4 weeks ago

Description

This PR will move much of the staging code that take place in the python initialization subroutines of the variational and ensemble DA jobs into Jinja2-templated YAML files to be passed into the wxflow file handler. Much of the staging has already been done this way, but this PR simply expands that strategy.

The old Python routines that were doing this staging are now removed. This is part of a broader refactoring of the pygfs tasking.

wxflow PR #30 is a companion to this PR.

Type of change

Change characteristics

How has this been tested?

Checklist

DavidNew-NOAA commented 3 weeks ago

@DavidHuber-NOAA Thanks for the suggestions. I tried tabbing everything for readability, but it generated errors. Let me retry tabbing things, and if I get the same errors, maybe I can run them by you and get some feedback on debugging.

DavidNew-NOAA commented 3 weeks ago

@DavidHuber-NOAA So here's the error I get when I tab out the for-loops in parm/gdas/staging/atm_var_bkg.yaml.j2. I'm wondering if Jinja2 is expecting a certain number spaces or something of that nature.

    "expected <block end>, but found %r" % token.id, token.start_mark)
yaml.parser.ParserError: while parsing a block collection
  in "<unicode string>", line 12, column 4:
       - ['/work/noaa/da/dnew/global-wo ...
       ^
expected <block end>, but found '<block sequence start>'
  in "<unicode string>", line 16, column 10:
             - ['/work/noaa/da/dnew/global-wo ...
             ^
DavidHuber-NOAA commented 3 weeks ago

@DavidNew-NOAA Tabbing should only be applied to lines of Jinja code. The yaml-specific lines have to have their tabbing maintained to be in the expected format. Here is an example: https://github.com/NOAA-EMC/global-workflow/blob/e7909af8d9e1f34140388a3f8556d8e582c58fe5/parm/archive/arcdir.yaml.j2#L24-L28

DavidNew-NOAA commented 3 weeks ago

@DavidHuber-NOAA Ah, thank you. I struggled a lot last week trying to figure out tabbing.

DavidNew-NOAA commented 1 week ago

@aerorahul Done

RussTreadon-NOAA commented 1 week ago

Hera test Install feature/stage_from_yaml at 7aa041e0 on Hera in /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml. Run C96C48_ufs_hybatmDA CI. 20240224 00Z gdasatmanlinit, enkfgdasatmensanlinit, and gfsatmanlinit abort with the following messages

gdasatmanlinit

Traceback (most recent call last):
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/scripts/exglobal_atm_analysis_initialize.py", line 23, in <module>
    AtmAnl = AtmAnalysis(config)
  File "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src/wxflow/logger.py", line 266, in wrapper
    retval = func(*args, **kwargs)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/atm_analysis.py", line 29, in __init__
    super().__init__(config)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/analysis.py", line 30, in __init__
    self.gdasapp_j2tmpl_dir = os.path.join(self.task_config.PARMgfs, 'gdas')
AttributeError: 'AtmAnalysis' object has no attribute 'task_config'
+ JGLOBAL_ATM_ANALYSIS_INITIALIZE[1]: postamble JGLOBAL_ATM_ANALYSIS_INITIALIZE 1718909230 1

enkfgdasatmensanlinit

Traceback (most recent call last):
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/scripts/exglobal_atmens_analysis_initialize.py", line 23, in <module>
    AtmEnsAnl = AtmEnsAnalysis(config)
  File "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src/wxflow/logger.py", line 266, in wrapper
    retval = func(*args, **kwargs)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/atmens_analysis.py", line 30, in __init__
    super().__init__(config)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/analysis.py", line 30, in __init__
    self.gdasapp_j2tmpl_dir = os.path.join(self.task_config.PARMgfs, 'gdas')
AttributeError: 'AtmEnsAnalysis' object has no attribute 'task_config'
+ JGLOBAL_ATMENS_ANALYSIS_INITIALIZE[1]: postamble JGLOBAL_ATMENS_ANALYSIS_INITIALIZE 1718909230 1

gfsatmanlinit

Traceback (most recent call last):
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/scripts/exglobal_atm_analysis_initialize.py", line 23, in <module>
    AtmAnl = AtmAnalysis(config)
  File "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src/wxflow/logger.py", line 266, in wrapper
    retval = func(*args, **kwargs)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/atm_analysis.py", line 29, in __init__
    super().__init__(config)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/analysis.py", line 30, in __init__
    self.gdasapp_j2tmpl_dir = os.path.join(self.task_config.PARMgfs, 'gdas')
AttributeError: 'AtmAnalysis' object has no attribute 'task_config'
+ JGLOBAL_ATM_ANALYSIS_INITIALIZE[1]: postamble JGLOBAL_ATM_ANALYSIS_INITIALIZE 1718909230 1
DavidNew-NOAA commented 1 week ago

@RussTreadon-NOAA This PR also updates the wxflow hash. I missed the latest commit and just re-updated. task_config is now created in the wxflow Task class, not the Analysis subclasses in G-W

RussTreadon-NOAA commented 1 week ago

Thank you @DavidNew-NOAA . Updated to e175b252. Failed gdasatmanlinit jobs rewound and rebooted. Same error in log file

  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/atm_analysis.py", line 29, in __init__
    super().__init__(config)
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/ush/python/pygfs/task/analysis.py", line 30, in __init__
    self.gdasapp_j2tmpl_dir = os.path.join(self.task_config.PARMgfs, 'gdas')
AttributeError: 'AtmAnalysis' object has no attribute 'task_config'

I confirmed that sorc/wxflow is the specified hash

Hera(hfe08):/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/sorc/wxflow$ git branch
* (HEAD detached at 5dad7dd)
  develop
DavidNew-NOAA commented 1 week ago

@RussTreadon-NOAA Ah yes, I forgot. hera.intel.lua loads wxflow as a hack here. GDASApp on Hera is not actually using the wxflow in the G-W.

RussTreadon-NOAA commented 1 week ago

@DavidNew-NOAA . I commented out the Hera wxflow hack, rewound and rebooted. This worked! The CI test is once again running. I'll check for completion later tonight.

RussTreadon-NOAA commented 1 week ago

Hera C96C48_ufs_hybatmDA CI

All jobs successfully ran to completion with the wxflow hack commented out in sorc/gdas.cd/modulefiles/GDAS/hera.intel.lua.

Hera(hfe07):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/prstage$ rocotostat -d prstage.db -w prstage.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Jun 20 2024 17:18:52    Jun 20 2024 18:30:18
202402240000        Done    Jun 20 2024 17:18:52    Jun 20 2024 23:50:17

Similar wxflow hacks are also in gaea.intel.lua and noaacloud.intel.lua. However, the wxflow hack lines in these modulefiles are commented out. A GDASApp issue and PR should be opened to remove wxflow hacks, both active and commented out.

CI was run under role.jedipara. This account can not use the fv3-cpu accounting code. It can only use da-cpu. Added ACCOUNT_SERVICE to ci/cases/yamls/ufs_hybatmDA_defaults.ci.yaml to set the service queue accounting code

@@ -5,6 +5,7 @@ base:
   DO_JEDIATMVAR: "YES"
   DO_JEDIATMENS: "YES"
   ACCOUNT: {{ 'HPC_ACCOUNT' | getenv }}
+  ACCOUNT_SERVICE: {{ 'HPC_ACCOUNT_SERVICE' | getenv }}
 atmanl:
   LAYOUT_X_ATMANL: 4
   LAYOUT_Y_ATMANL: 4
RussTreadon-NOAA commented 1 week ago

Merge DavidNew-NOAA:feature/stage_from_yaml into RussTreadon-NOAA:feature/rename_atm. No conflicts. Will install local merge in role.jedipara and run GDASApp ctests plus C96C48_ufs_hybatmDA CI. Will push local merge to RussTreadon-NOAA:feature/rename_atm pending successful tests.

RussTreadon-NOAA commented 1 week ago

@DavidNew-NOAA and @CoryMartin-NOAA : ran test_gdasapp from install of DavidNew-NOAA:feature/stage_from_yaml inside g-w. 47 of 48 tests pass. The one failure is test_gdasapp_aero_gen_3dvar_yaml

1869: Test command: /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/sorc/gdas.cd/bundle/gdas/test/aero/genyaml_3dvar.sh "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/sorc/gdas.cd/build/gdas" "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/sorc/gdas.cd/bundle/gdas" "WORKING" "DIRECTORY" "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/stage_from_yaml/sorc/gdas.cd/build/gdas/test/testrun/"
1869: Test timeout computed to be: 1500
1869: Traceback (most recent call last):
1869:   File "<stdin>", line 1, in <module>
1869: ModuleNotFoundError: No module named 'wxflow'
1/1 Test #1869: test_gdasapp_aero_gen_3dvar_yaml ...***Failed    0.24 sec

Script sorc/gdas.cd/test/aero/genyaml_3dvar.sh contains

# run some python code to generate the YAML
python3 - <<EOF
from wxflow import parse_j2yaml
import datetime

valid_time_obj = datetime.datetime.strptime('$CDATE','%Y%m%d%H')

Without the wxflow hack in modulefiles/GDAS/hera.intel.lua, wxflow is not defined.

sorc/gdas.cd/test/atm/global-workflow/jjob_var_run.sh contains

# Set python path for workflow utilities and tasks
wxflowPATH="${HOMEgfs}/ush/python:${HOMEgfs}/ush/python/wxflow"
PYTHONPATH="${PYTHONPATH:+${PYTHONPATH}:}${wxflowPATH}"
export PYTHONPATH

A similar approach could be considered for genyaml_3dvar.sh. This change goes in GDASApp, not g-w.

RussTreadon-NOAA commented 1 week ago

Hera tests

Make the following local change to test/aero/genyaml_3dvar.sh in merged copy of feature/stage_from_yaml and feature/rename_atm

@@ -24,6 +24,15 @@ export YAMLout=$DATA/3dvar_gfs_aero.yaml
 rm -rf $DATA
 mkdir -p $DATA

+# Set g-w HOMEgfs
+topdir=$(cd "$(dirname "$(readlink -f -n "${bindir}" )" )/../../.." && pwd -P)
+export HOMEgfs=$topdir
+
+# Set python path for workflow utilities and tasks
+wxflowPATH="${HOMEgfs}/ush/python:${HOMEgfs}/ush/python/wxflow"
+PYTHONPATH="${PYTHONPATH:+${PYTHONPATH}:}${wxflowPATH}"
+export PYTHONPATH
+
 # run some python code to generate the YAML
 python3 - <<EOF
 from wxflow import parse_j2yaml

This addition was made in response to removing the wxflow hack

+++ b/modulefiles/GDAS/hera.intel.lua
@@ -74,9 +74,6 @@ load("py-xarray/2023.7.0")
 load("py-f90nml/1.4.3")
 load("py-pip/23.1.2")

--- hack for wxflow
-prepend_path("PYTHONPATH", "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src")
-
 setenv("CC","mpiicc")
 setenv("FC","mpiifort")
 setenv("CXX","mpiicpc")

from modulefiles/GDAS/hera.intel.lua

With these changes in place, rerun test_gdasapp ctests. 48 out of 48 tests pass

Test project /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/merge/sorc/gdas.cd/build
      Start 1488: test_gdasapp_util_coding_norms
 1/48 Test #1488: test_gdasapp_util_coding_norms ........................   Passed    3.20 sec
...
      Start 1869: test_gdasapp_aero_gen_3dvar_yaml
48/48 Test #1869: test_gdasapp_aero_gen_3dvar_yaml ......................   Passed    1.17 sec

100% tests passed, 0 tests failed out of 48

Label Time Summary:
gdas-utils    =  11.66 sec*proc (11 tests)
script        =  11.66 sec*proc (11 tests)

Total Test time (real) = 2082.20 sec

g-w C96C48_ufs_hybatmDA CI successfully ran all jobs.

Hera(hfe06):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/prmerge$ rocotostat -d prmerge.db -w prmerge.xml  -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202402231800        Done    Jun 21 2024 02:24:10    Jun 21 2024 02:45:12
202402240000        Done    Jun 21 2024 02:24:10    Jun 21 2024 09:25:11

Given this push merger of feature/stage_from_yaml into feature/rename_atm to github. Done at b00d31e5.

This PR, #2654, may be closed since it has been folded into PR #2700 as per EIB's request to merge these two PRs into one.

NOTE: The above changes to two files in gdas.cd are only in the local working copy. These changes need to be committed to GDASApp develop and the gdas.cd hash updated in feature/rename_atm.

CoryMartin-NOAA commented 1 week ago

I suggest we make a note/issue to fix this in GDASApp (which it looks like @RussTreadon-NOAA already did), and not let the aero gen YAML test hold up this PR to the g-w

emcbot commented 1 week ago

CI Update on Wcoss2 at 06/25/24 02:51:16 PM
============================================
Cloning and Building global-workflow PR: 2654
with PID: 161823 on host: dlogin08
emcbot commented 1 week ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Tue Jun 25 14:55:52 UTC 2024 on dlogin08
---------------------------------------------------
Build: Completed at 06/25/24 03:31:18 PM
Case setup: Completed for experiment C48_ATM_e175b252
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_e175b252
Case setup: Skipped for experiment C48_S2SWA_gefs_e175b252
Case setup: Completed for experiment C48_S2SW_e175b252
Case setup: Completed for experiment C96_atm3DVar_extended_e175b252
Case setup: Skipped for experiment C96_atm3DVar_e175b252
Case setup: Skipped for experiment C96_atmaerosnowDA_e175b252
Case setup: Completed for experiment C96C48_hybatmDA_e175b252
Case setup: Completed for experiment C96C48_ufs_hybatmDA_e175b252
emcbot commented 1 week ago

Experiment C48_ATM_e175b252 SUCCESS on Wcoss2 at 06/25/24 04:44:07 PM

emcbot commented 1 week ago

Experiment C48_S2SW_e175b252 SUCCESS on Wcoss2 at 06/25/24 04:48:09 PM

DavidNew-NOAA commented 1 week ago

I see CI testing happening on this branch, but wasn't this PR was combined with #2700 ?

aerorahul commented 1 week ago

@DavidNew-NOAA you are right. This PR should be closed.

aerorahul commented 1 week ago

Closing as this PR is combined w/ #2700

emcbot commented 1 week ago

Experiment C96C48_hybatmDA_e175b252 SUCCESS on Wcoss2 at 06/25/24 05:36:19 PM

emcbot commented 1 week ago

Experiment C96C48_ufs_hybatmDA_e175b252 SUCCESS on Wcoss2 at 06/25/24 05:48:13 PM

emcbot commented 1 week ago

Experiment C96_atm3DVar_extended_e175b252 SUCCESS on Wcoss2 at 06/26/24 02:04:39 AM

emcbot commented 1 week ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_e175b252 *** SUCCESS *** at 06/25/24 04:44:07 PM
Experiment C48_S2SW_e175b252 *** SUCCESS *** at 06/25/24 04:48:09 PM
Experiment C96C48_hybatmDA_e175b252 *** SUCCESS *** at 06/25/24 05:36:19 PM
Experiment C96C48_ufs_hybatmDA_e175b252 *** SUCCESS *** at 06/25/24 05:48:13 PM
Experiment C96_atm3DVar_extended_e175b252 *** SUCCESS *** at 06/26/24 02:04:39 AM