NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Enable wcoss2 ufsda build and module load #2620

Closed RussTreadon-NOAA closed 1 month ago

RussTreadon-NOAA commented 1 month ago

Description

This PR enables ufsda (sorc/gdas.cd) to be built and run on WCOSS2.

Resolves #2602 Resolves #2579

Type of change

Change characteristics

How has this been tested?

Clone, build, and run C96C48_ufs_hybatmDA CI on WCOSS2 (Cactus)

Checklist

emcbot commented 1 month ago

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2620

TerrenceMcGuinness-NOAA commented 1 month ago

@RussTreadon-NOAA ./ci/cases/yamls/build.yaml Is used for "duel builds" using Jenkins on the RDHPCS machines. WCOSS CI uses the Bash system and only builds with -g. This is managed by using the skip_ci_on_hosts: option in the ci yaml case file.

RussTreadon-NOAA commented 1 month ago

Thank you @TerrenceMcGuinness-NOAA for your reply. I did not know that bash builds used ci/scripts/clone-build_ci.sh.

@WalterKolczynski-NOAA, the -u option for the UFS-DA (GDASApp) build has been added. Done at ad49991. This alone may not be sufficient for WCOSS2 CI to pass.

@TerrenceMcGuinness-NOAA , it seems we use your account to run g-w CI on WCOSS2. Is your WCOSS2 account configured to use git-lfs? Some of the JEDI repos built by build_gdas.sh use git-lfs

WalterKolczynski-NOAA commented 1 month ago

Thank you @TerrenceMcGuinness-NOAA for your reply. I did not know that bash builds used ci/scripts/clone-build_ci.sh.

@WalterKolczynski-NOAA, the -u option for the UFS-DA (GDASApp) build has been added. Done at ad49991. This alone may not be sufficient for WCOSS2 CI to pass.

@TerrenceMcGuinness-NOAA , it seems we use your account to run g-w CI on WCOSS2. Is your WCOSS2 account configured to use git-lfs? Some of the JEDI repos built by build_gdas.sh use git-lfs

Nothing in the workflow should assume anything is in a user's profile (and we do a module reset anyway). If git-lfs is needed, it needs to be added to the modulefile the build script is running.

RussTreadon-NOAA commented 1 month ago

@WalterKolczynski-NOAA: The following test has been completed on Cactus

  1. cd /lfs/h2/emc/ptmp/russ.treadon
  2. rsync -av --progress /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc/gdas.cd .
  3. cd gdas.cd
  4. ./build.sh -v -f > build.log 2>&1

The build finished. It seems git-lfs is not necessary during the GDASApp build step.

Please kick off CI on WCOSS2 using the current head of RussTreadon-NOAA:feature/wcoss2_ufsda. The current head, (ad49991), added the -u option to the build_all.sh line in ci/scripts/clone-build_ci.sh.

RussTreadon-NOAA commented 1 month ago

Cactus test

  1. rsync /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow to /lfs/h2/emc/ptmp/russ.treadon/global-workflow
  2. execute build_all.sh -guk in my copy of global-workflow
  3. execute link_workflow.sh
  4. set up and run C96C48_ufs_hybatmDA in /lfs/h2/emc/ptmp/russ.treadon/[EXPDIR, COMROOT]
  5. gdas, gfs, and enkfgdas DA cycles successfully run to completion. Only gfsfcst and downstream jobs remain to finish.
RussTreadon-NOAA commented 1 month ago

@WalterKolczynski-NOAA , are developers allowed to change labels or is this restricted to the g-w team? I'd like to see if WCOSS CI passes so we can move onto the next set of GDASApp g-w PRs.

WalterKolczynski-NOAA commented 1 month ago

@WalterKolczynski-NOAA , are developers allowed to change labels or is this restricted to the g-w team? I'd like to see if WCOSS CI passes so we can move onto the next set of GDASApp g-w PRs.

We ask that only CMs handle CI labels.

WalterKolczynski-NOAA commented 1 month ago

Will rerun WCOSS as soon as @TerrenceMcGuinness-NOAA updates the CI driver script to build with -u.

RussTreadon-NOAA commented 1 month ago

OK, @WalterKolczynski-NOAA . I won't touch g-w CI labels. Is there anything I can do to help move this PR along?

RussTreadon-NOAA:feature/wcoss2_ufsda has been updated the to current head of g-w develop. C96C48_ufs_hybatmDA is running on Cactus. So far, so good. As we have seen successfully running C96C48_ufs_hybatmDA under russ.treadon does not guarantee automated g-w CI can successfully run C96C48_ufs_hybatmDA.

RussTreadon-NOAA commented 1 month ago

@WalterKolczynski-NOAA , which CI driver script does @TerrenceMcGuinness-NOAA need to update to build with -u?

ad49991 added the u option to ci/scripts/clone-build_ci.sh. Is this sufficient for automated g-w CI?

WalterKolczynski-NOAA commented 1 month ago

@WalterKolczynski-NOAA , which CI driver script does @TerrenceMcGuinness-NOAA need to update to build with -u?

ad49991 added the u option to ci/scripts/clone-build_ci.sh. Is this sufficient for automated g-w CI?

I think that's the one. Terry just needs to put it into his version that runs the CI (I believe that, unlike the machines using Jenkins, it won't pull that specific thing from the PR).

RussTreadon-NOAA commented 1 month ago

Thank you @WalterKolczynski-NOAA for the confirmation and update. WCOSS2 is unique in many regards.

emcbot commented 1 month ago

CI Update on Wcoss2 at 06/04/24 05:40:41 PM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 24603 on host: clogin05
emcbot commented 1 month ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Tue Jun  4 17:47:21 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/04/24 06:21:53 PM
Case setup: Completed for experiment C48_ATM_ca024035
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_ca024035
Case setup: Skipped for experiment C48_S2SWA_gefs_ca024035
Case setup: Completed for experiment C48_S2SW_ca024035
Case setup: Completed for experiment C96_atm3DVar_extended_ca024035
Case setup: Skipped for experiment C96_atm3DVar_ca024035
Case setup: Skipped for experiment C96_atmaerosnowDA_ca024035
Case setup: Completed for experiment C96C48_hybatmDA_ca024035
Case setup: Completed for experiment C96C48_ufs_hybatmDA_ca024035
emcbot commented 1 month ago

Experiment C48_ATM_ca024035 **** on Wcoss2 at 06/04/24 09:42:14 PM

Error logs:

Follow link here to view the contents of the above file(s): [(link)]()

emcbot commented 1 month ago

CI Update on Wcoss2 at 06/04/24 09:48:54 PM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 72270 on host: clogin05
emcbot commented 1 month ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Tue Jun  4 21:57:42 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/04/24 10:33:39 PM
Case setup: Completed for experiment C48_ATM_ca024035
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_ca024035
Case setup: Skipped for experiment C48_S2SWA_gefs_ca024035
Case setup: Completed for experiment C48_S2SW_ca024035
Case setup: Completed for experiment C96_atm3DVar_extended_ca024035
Case setup: Skipped for experiment C96_atm3DVar_ca024035
Case setup: Skipped for experiment C96_atmaerosnowDA_ca024035
Case setup: Completed for experiment C96C48_hybatmDA_ca024035
Case setup: Completed for experiment C96C48_ufs_hybatmDA_ca024035
emcbot commented 1 month ago

Experiment C48_ATM_ca024035 **** on Wcoss2 at 06/04/24 10:36:11 PM

Error logs:

Follow link here to view the contents of the above file(s): [(link)]()

emcbot commented 1 month ago

Experiment C48_S2SW_ca024035 **** on Wcoss2 at 06/04/24 11:42:17 PM

Error logs:

Follow link here to view the contents of the above file(s): [(link)]()

RussTreadon-NOAA commented 1 month ago

The (link) referenced above just takes us to this PR. A check of the logs for C48_S2SW suggests that the stage and init jobs wound up in a strange state on Cactus

russ.treadon@clogin09:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/EXPDIR/C48_S2SW_ca024035/logs> cat 2021032312.log 
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfsstage_ic
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfswaveinit
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfsstage_ic is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfswaveinit is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfsstage_ic is success, jobid=134280588
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfswaveinit is success, jobid=134280589
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfsstage_ic, jobid=134280588, in state UNKNOWN (F)
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfswaveinit, jobid=134280589, in state UNKNOWN (F)

I'm not sure what's going on.

WalterKolczynski-NOAA commented 1 month ago

The (link) referenced above just takes us to this PR. A check of the logs for C48_S2SW suggests that the stage and init jobs wound up in a strange state on Cactus

russ.treadon@clogin09:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/EXPDIR/C48_S2SW_ca024035/logs> cat 2021032312.log 
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfsstage_ic
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfswaveinit
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfsstage_ic is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfswaveinit is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfsstage_ic is success, jobid=134280588
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfswaveinit is success, jobid=134280589
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfsstage_ic, jobid=134280588, in state UNKNOWN (F)
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfswaveinit, jobid=134280589, in state UNKNOWN (F)

I'm not sure what's going on.

Something internal to the CI system. Terry is monitoring it closely and manually adjusted some things.

RussTreadon-NOAA commented 1 month ago

OK, @WalterKolczynski-NOAA. I'll stand down.

TerrenceMcGuinness-NOAA commented 1 month ago

@RussTreadon-NOAA as far as I can tell it looks more like WCOSS2 is having pbs issues and is showing UNKNOWN rocoto status randomly in the CI experiments which is reflexive of those logs. This time its with C48_S2SW.

terry.mcguinness (clogin02) C48_S2SW_ca024035 $ rocotostat   -w C48_S2SW_ca024035.xml -d C48_S2SW_ca024035.db | head -5
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202103231200             gfsstage_ic                   134280588             UNKNOWN                   -         0           0.0
202103231200             gfswaveinit                   134280589             UNKNOWN                   -         0           0.0
202103231200                 gfsfcst                           -                   -                   -         -             -
terry.mcguinness (clogin02) C48_S2SW_ca024035 $
TerrenceMcGuinness-NOAA commented 1 month ago

@WalterKolczynski-NOAA we should add long waits in the rocotostat python code for when we get UNKNONW. I'm going to restart it one more time and then review it in the morning.

emcbot commented 1 month ago

Experiment C48_ATM_ca024035 SUCCESS on Wcoss2 at 06/05/24 02:12:11 AM

emcbot commented 1 month ago

Experiment C48_S2SW_ca024035 SUCCESS on Wcoss2 at 06/05/24 02:42:13 AM

emcbot commented 1 month ago

Experiment C96C48_hybatmDA_ca024035 SUCCESS on Wcoss2 at 06/05/24 03:21:21 AM

emcbot commented 1 month ago

Experiment C96C48_ufs_hybatmDA_ca024035 SUCCESS on Wcoss2 at 06/05/24 03:36:11 AM

emcbot commented 1 month ago

Experiment C96_atm3DVar_extended_ca024035 SUCCESS on Wcoss2 at 06/05/24 09:21:30 AM

emcbot commented 1 month ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_ca024035 *** SUCCESS *** at 06/05/24 02:12:11 AM
Experiment C48_S2SW_ca024035 *** SUCCESS *** at 06/05/24 02:42:13 AM
Experiment C96C48_hybatmDA_ca024035 *** SUCCESS *** at 06/05/24 03:21:21 AM
Experiment C96C48_ufs_hybatmDA_ca024035 *** SUCCESS *** at 06/05/24 03:36:11 AM
Experiment C96_atm3DVar_extended_ca024035 *** SUCCESS *** at 06/05/24 09:21:30 AM
RussTreadon-NOAA commented 1 month ago

Yeah, WCOSS2-CI Passed!

@WalterKolczynski-NOAA , RussTreadon-NOAA:feature/wcoss2_ufsda is one commit behind the current head of develop. feature/wcoss2_ufsda does not include the two files committed at 67b833e.

Shall I update RussTreadon-NOAA:feature/wcoss2_ufsda? I have no problem doing so, but I am concerned that doing so might trigger another round of CI testing across supported platforms.

WalterKolczynski-NOAA commented 1 month ago

Yeah, WCOSS2-CI Passed!

@WalterKolczynski-NOAA , RussTreadon-NOAA:feature/wcoss2_ufsda is one commit behind the current head of develop. feature/wcoss2_ufsda does not include the two files committed at 67b833e.

Shall I update RussTreadon-NOAA:feature/wcoss2_ufsda? I have no problem doing so, but I am concerned that doing so might trigger another round of CI testing across supported platforms.

Nope, we're good.

RussTreadon-NOAA commented 1 month ago

Thank you @WalterKolczynski-NOAA for working with me on this PR and merging it into develop. Thank you @TerrenceMcGuinness-NOAA for helping us resolve WCOSS2 CI issues.