NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
GNU Lesser General Public License v3.0
74 stars 165 forks source link

gdasstage_ic and gdasfcst_seg0 disagree on staged filenames for ocean restarts #2865

Open jswhit opened 2 weeks ago

jswhit commented 2 weeks ago

What is wrong?

for a coupled 3dVar cycling experiment with cold starts for 2021032400 gdasstage_ic stages oceans restarts with m_prefix = 20210323.180000, but gdasfcst_seg0 then looks for restarts with m_prefix = 20210324.0000. Here's the error from gdasfcst_sego:

/bin/cp -p /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart/ /scratch1/NCEPDEV/stmp2/Jeffrey.S.Whitaker/RUNDIRS/C192coupled3dvar_test/gdas.2021032400/gdasfcst.2021032400/fcst.1385029/INPUT/
/bin/cp: cannot stat '/scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart/': No such file or directory

and the relevant output from gdasstage_ic:

[[38;21m2024-08-27 15:16:21,927 - INFO     - file_utils  : Created /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart^[[0m
^[[38;21m2024-08-27 15:16:27,068 - INFO     - file_utils  : Copied /scratch2/BMC/gsienkf/whitaker/replayics/C192mx025//gdas.20210323/18/model/ocean/restart/ to /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart^[[0m
^[[38;21m2024-08-27 15:16:33,344 - INFO     - file_utils  : Copied /scratch2/BMC/gsienkf/whitaker/replayics/C192mx025//gdas.20210323/18/model/ocean/restart/ to /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart^[[0m
^[[38;21m2024-08-27 15:16:40,148 - INFO     - file_utils  : Copied /scratch2/BMC/gsienkf/whitaker/replayics/C192mx025//gdas.20210323/18/model/ocean/restart/ to /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart^[[0m
^[[38;21m2024-08-27 15:16:41,466 - INFO     - file_utils  : Copied /scratch2/BMC/gsienkf/whitaker/replayics/C192mx025//gdas.20210323/18/model/ocean/restart/ to /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C192coupled3dvar_test/gdas.20210323/18//model/ocean/restart^[[0m

What should have happened?

gdasstage_ic stages ocean restarts with the same filenames expected by gdasfcst_seg0 for cold starts.

What machines are impacted?


Steps to reproduce

run a cycled 3DVar coupled atm/ocean experiment. I ran

pslot="C192coupled3dvar_test" HPC_ACCOUNT="gsienkf" RUNTESTS="/scratch2/BMC/gsienkf/whitaker/GWTESTS" ICSDIR_ROOT="/scratch2/BMC/gsienkf/whitaker/replayics/C192mx025/"  ./workflow/ --yaml ci/cases/pr/C192mx025_3DVarAOWCDA.yaml

from /scratch2/BMC/gsienkf/whitaker/global-workflow-jswhit, with the patch from issue #2864 applied so that gdasstage_icdoes not fail.

Additional information


Do you have a proposed solution?


KateFriedman-NOAA commented 2 weeks ago

Will work on this, thanks for reporting @jswhit !

KateFriedman-NOAA commented 2 weeks ago

Alrighty, so the gdasstage_ic job picks up your ICs "correctly" using m_prefix=20210323.210000, which is based on model_start_date_current_cycle minus 3hrs because DOIAU=YES. The gdasfcst_seg0 job then initially sets model_start_date_current_cycle to the same time (from a log from my reproduction of the issue):

1453 +[94]: model_start_date_current_cycle=2021032321

...but later on it gets set to the cycle that's running because the experiment is cold-starting, which means IAU is off and the model start date would not be 3hrs earlier:

1907 +[27]: model_start_date_current_cycle=2021032400

That happens in here:

Based on the above, either:

  1. the ocean ICs need to have the non-IAU model start date and the staging job needs an adjustment for cold-start for the ocean ICs
  2. the forecast job needs to be updated to handle the ocean restarts differently (treat them like a warm start while treating the atmosphere ICs for cold start)

Pretty sure option 2 is what is needed. Thoughts?

jswhit commented 2 weeks ago

I believe 2 was how things worked before

jswhit2 commented 2 weeks ago

FWIW, this fixes my particular case (cold start for atmosphere, warm starts for ocean/ice)

diff --git a/ush/ b/ush/
index 8af90549..2adf1aa1 100755
--- a/ush/
+++ b/ush/
@@ -415,7 +415,8 @@ MOM6_postdet() {
   else  # "${RERUN}" == "NO"
-    restart_date="${model_start_date_current_cycle}"
+    #restart_date="${model_start_date_current_cycle}"
+    restart_date="${current_cycle_begin}"

   # Copy MOM6 ICs
@@ -565,7 +566,8 @@ CICE_postdet() {
     seconds=$(to_seconds "${restart_date:8:2}0000")  # convert HHMMSS to seconds
   else  # "${RERUN}" == "NO"
-    restart_date="${model_start_date_current_cycle}"
+    #restart_date="${model_start_date_current_cycle}"
+    restart_date="${current_cycle_begin}"
     if [[ "${DO_JEDIOCNVAR:-NO}" == "YES" ]]; then
KateFriedman-NOAA commented 2 weeks ago

Good to know, thanks @jswhit ! Didn't get a chance to look deep into this yesterday, will aim to today.

KateFriedman-NOAA commented 1 week ago

@jswhit I see now that the staging needed adjusting. When I tested it it worked but I see now that you had symlinks from the 20210323.210000.MOM.res*.nc files to the correct 20210324.000000.MOM.res*.nc files so it was a false success for me. I updated the staging yaml files to fix the issue in issue #2890 and it seems to have fixed things for the staging job in this case too.

I just ran the gdasstage_ic and gdasfcst_seg0 job for your case and they worked. Would you mind copying the yaml from my clone on Hera (/scratch1/NCEPDEV/global/Kate.Friedman/git/develop_fork/parm/stage) into your clone's parm/stage folder and try the staging and fcst jobs for your case? Let me know if it works as anticipated. Thanks!

jswhit commented 3 days ago

sorry for the late reply @KateFriedman-NOAA. When I copy your parm/stage directory, the staging job seems to run fine but I'm getting this error in gdasfcst_seg0.log. I think it's probably unrelated to this issue, but I don't seem to have a sorc/upp.fd/parm/gfs directory (which parm/post/gfs is symlinked to).

+[544]: /bin/cp -p /scratch2/BMC/gsienkf/whitaker/global-workflow-jswhit2/parm/post/gfs/postxconfig-NT-gfs-two.txt /scratch1/NCEPDEV/stmp2/Jeffrey.S.Whitaker/RUNDIRS/C96coupled3dvar_test/gdas.2021032400/gdasfcst.2021032400/fcst.998793/postxconfig-NT.txt
/bin/cp: cannot stat '/scratch2/BMC/gsienkf/whitaker/global-workflow-jswhit2/parm/post/gfs/postxconfig-NT-gfs-two.txt': No such file or directory
KateFriedman-NOAA commented 2 days ago

@jswhit There was an update to the system related to UPP and its parm txt files so you'll either want to make a fresh clone or do a submodule update command (and then link script) in your clone to remedy the issue.

jswhit commented 2 days ago

Okay, got the submodules updated correctly. Now I get this error[440]: /bin/cp -p /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C96coupled3dvar_test/gdas.20210323/18//model/ocean/restart/ /scratch1/NCEPDEV/stmp2/Jeffrey.S.Whitaker/RUNDIRS/C96coupled3dvar_test/gdas.2021032400/gdasfcst.2021032400/fcst.1355321/INPUT/
/bin/cp: cannot stat '/scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C96coupled3dvar_test/gdas.20210323/18//model/ocean/restart/': No such file or directory
+[441]: echo 'FATAL ERROR: Unable to copy MOM6 IC, ABORT!'
FATAL ERROR: Unable to copy MOM6 IC, ABORT!
+[441]: exit 1

in gdasfcst_seg0 (see /scratch2/BMC/gsienkf/whitaker/GWTESTS/COMROOT/C96coupled3dvar_test/logs/2021032400)

The file that was staged is, not