Closed RussTreadon-NOAA closed 1 month ago
CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2620
@RussTreadon-NOAA ./ci/cases/yamls/build.yaml
Is used for "duel builds" using Jenkins on the RDHPCS machines. WCOSS CI uses the Bash system and only builds with -g. This is managed by using the skip_ci_on_hosts: option in the ci yaml case file.
Thank you @TerrenceMcGuinness-NOAA for your reply. I did not know that bash builds used ci/scripts/clone-build_ci.sh
.
@WalterKolczynski-NOAA, the -u
option for the UFS-DA (GDASApp) build has been added. Done at ad49991
. This alone may not be sufficient for WCOSS2 CI to pass.
@TerrenceMcGuinness-NOAA , it seems we use your account to run g-w CI on WCOSS2. Is your WCOSS2 account configured to use git-lfs
? Some of the JEDI repos built by build_gdas.sh
use git-lfs
Thank you @TerrenceMcGuinness-NOAA for your reply. I did not know that bash builds used
ci/scripts/clone-build_ci.sh
.@WalterKolczynski-NOAA, the
-u
option for the UFS-DA (GDASApp) build has been added. Done atad49991
. This alone may not be sufficient for WCOSS2 CI to pass.@TerrenceMcGuinness-NOAA , it seems we use your account to run g-w CI on WCOSS2. Is your WCOSS2 account configured to use
git-lfs
? Some of the JEDI repos built bybuild_gdas.sh
usegit-lfs
Nothing in the workflow should assume anything is in a user's profile (and we do a module reset anyway). If git-lfs
is needed, it needs to be added to the modulefile the build script is running.
@WalterKolczynski-NOAA: The following test has been completed on Cactus
cd /lfs/h2/emc/ptmp/russ.treadon
rsync -av --progress /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc/gdas.cd .
cd gdas.cd
./build.sh -v -f > build.log 2>&1
The build finished. It seems git-lfs
is not necessary during the GDASApp build step.
Please kick off CI on WCOSS2 using the current head of RussTreadon-NOAA:feature/wcoss2_ufsda
. The current head, (ad49991), added the -u
option to the build_all.sh
line in ci/scripts/clone-build_ci.sh
.
Cactus test
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow
to /lfs/h2/emc/ptmp/russ.treadon/global-workflow
build_all.sh -guk
in my copy of global-workflow
link_workflow.sh
/lfs/h2/emc/ptmp/russ.treadon/[EXPDIR, COMROOT]
@WalterKolczynski-NOAA , are developers allowed to change labels or is this restricted to the g-w team? I'd like to see if WCOSS CI passes so we can move onto the next set of GDASApp g-w PRs.
@WalterKolczynski-NOAA , are developers allowed to change labels or is this restricted to the g-w team? I'd like to see if WCOSS CI passes so we can move onto the next set of GDASApp g-w PRs.
We ask that only CMs handle CI labels.
Will rerun WCOSS as soon as @TerrenceMcGuinness-NOAA updates the CI driver script to build with -u
.
OK, @WalterKolczynski-NOAA . I won't touch g-w CI labels. Is there anything I can do to help move this PR along?
RussTreadon-NOAA:feature/wcoss2_ufsda
has been updated the to current head of g-w develop
. C96C48_ufs_hybatmDA is running on Cactus. So far, so good. As we have seen successfully running C96C48_ufs_hybatmDA under russ.treadon
does not guarantee automated g-w CI can successfully run C96C48_ufs_hybatmDA.
@WalterKolczynski-NOAA , which CI driver script does @TerrenceMcGuinness-NOAA need to update to build with -u?
ad49991 added the u
option to ci/scripts/clone-build_ci.sh
. Is this sufficient for automated g-w CI?
@WalterKolczynski-NOAA , which CI driver script does @TerrenceMcGuinness-NOAA need to update to build with -u?
ad49991 added the
u
option toci/scripts/clone-build_ci.sh
. Is this sufficient for automated g-w CI?
I think that's the one. Terry just needs to put it into his version that runs the CI (I believe that, unlike the machines using Jenkins, it won't pull that specific thing from the PR).
Thank you @WalterKolczynski-NOAA for the confirmation and update. WCOSS2 is unique in many regards.
CI Update on Wcoss2 at 06/04/24 05:40:41 PM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 24603 on host: clogin05
Automated global-workflow Testing Results:
Machine: Wcoss2
Start: Tue Jun 4 17:47:21 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/04/24 06:21:53 PM
Case setup: Completed for experiment C48_ATM_ca024035
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_ca024035
Case setup: Skipped for experiment C48_S2SWA_gefs_ca024035
Case setup: Completed for experiment C48_S2SW_ca024035
Case setup: Completed for experiment C96_atm3DVar_extended_ca024035
Case setup: Skipped for experiment C96_atm3DVar_ca024035
Case setup: Skipped for experiment C96_atmaerosnowDA_ca024035
Case setup: Completed for experiment C96C48_hybatmDA_ca024035
Case setup: Completed for experiment C96C48_ufs_hybatmDA_ca024035
Experiment C48_ATM_ca024035 **** on Wcoss2 at 06/04/24 09:42:14 PM
Error logs:
Follow link here to view the contents of the above file(s): [(link)]()
CI Update on Wcoss2 at 06/04/24 09:48:54 PM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 72270 on host: clogin05
Automated global-workflow Testing Results:
Machine: Wcoss2
Start: Tue Jun 4 21:57:42 UTC 2024 on clogin05
---------------------------------------------------
Build: Completed at 06/04/24 10:33:39 PM
Case setup: Completed for experiment C48_ATM_ca024035
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_ca024035
Case setup: Skipped for experiment C48_S2SWA_gefs_ca024035
Case setup: Completed for experiment C48_S2SW_ca024035
Case setup: Completed for experiment C96_atm3DVar_extended_ca024035
Case setup: Skipped for experiment C96_atm3DVar_ca024035
Case setup: Skipped for experiment C96_atmaerosnowDA_ca024035
Case setup: Completed for experiment C96C48_hybatmDA_ca024035
Case setup: Completed for experiment C96C48_ufs_hybatmDA_ca024035
Experiment C48_ATM_ca024035 **** on Wcoss2 at 06/04/24 10:36:11 PM
Error logs:
Follow link here to view the contents of the above file(s): [(link)]()
Experiment C48_S2SW_ca024035 **** on Wcoss2 at 06/04/24 11:42:17 PM
Error logs:
Follow link here to view the contents of the above file(s): [(link)]()
The (link) referenced above just takes us to this PR. A check of the logs for C48_S2SW suggests that the stage and init jobs wound up in a strange state on Cactus
russ.treadon@clogin09:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/EXPDIR/C48_S2SW_ca024035/logs> cat 2021032312.log
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfsstage_ic
2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfswaveinit
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfsstage_ic is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfswaveinit is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfsstage_ic is success, jobid=134280588
2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfswaveinit is success, jobid=134280589
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfsstage_ic, jobid=134280588, in state UNKNOWN (F)
2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfswaveinit, jobid=134280589, in state UNKNOWN (F)
I'm not sure what's going on.
The (link) referenced above just takes us to this PR. A check of the logs for C48_S2SW suggests that the stage and init jobs wound up in a strange state on Cactus
russ.treadon@clogin09:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/EXPDIR/C48_S2SW_ca024035/logs> cat 2021032312.log 2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfsstage_ic 2024-06-04 22:35:09 +0000 :: clogin09 :: Submitting gfswaveinit 2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfsstage_ic is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675 2024-06-04 22:35:09 +0000 :: clogin09 :: Submission status of gfswaveinit is pending at druby://clogin09.cactus.wcoss2.ncep.noaa.gov:44675 2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfsstage_ic is success, jobid=134280588 2024-06-04 23:40:11 +0000 :: clogin09 :: Submission status of previously pending gfswaveinit is success, jobid=134280589 2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfsstage_ic, jobid=134280588, in state UNKNOWN (F) 2024-06-04 23:40:12 +0000 :: clogin09 :: Task gfswaveinit, jobid=134280589, in state UNKNOWN (F)
I'm not sure what's going on.
Something internal to the CI system. Terry is monitoring it closely and manually adjusted some things.
OK, @WalterKolczynski-NOAA. I'll stand down.
@RussTreadon-NOAA as far as I can tell it looks more like WCOSS2 is having pbs issues and is showing UNKNOWN rocoto status randomly in the CI experiments which is reflexive of those logs. This time its with C48_S2SW.
terry.mcguinness (clogin02) C48_S2SW_ca024035 $ rocotostat -w C48_S2SW_ca024035.xml -d C48_S2SW_ca024035.db | head -5
CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
================================================================================================================================
202103231200 gfsstage_ic 134280588 UNKNOWN - 0 0.0
202103231200 gfswaveinit 134280589 UNKNOWN - 0 0.0
202103231200 gfsfcst - - - - -
terry.mcguinness (clogin02) C48_S2SW_ca024035 $
@WalterKolczynski-NOAA we should add long waits in the rocotostat python code for when we get UNKNONW. I'm going to restart it one more time and then review it in the morning.
Experiment C48_ATM_ca024035 SUCCESS on Wcoss2 at 06/05/24 02:12:11 AM
Experiment C48_S2SW_ca024035 SUCCESS on Wcoss2 at 06/05/24 02:42:13 AM
Experiment C96C48_hybatmDA_ca024035 SUCCESS on Wcoss2 at 06/05/24 03:21:21 AM
Experiment C96C48_ufs_hybatmDA_ca024035 SUCCESS on Wcoss2 at 06/05/24 03:36:11 AM
Experiment C96_atm3DVar_extended_ca024035 SUCCESS on Wcoss2 at 06/05/24 09:21:30 AM
All CI Test Cases Passed on Wcoss2:
Experiment C48_ATM_ca024035 *** SUCCESS *** at 06/05/24 02:12:11 AM
Experiment C48_S2SW_ca024035 *** SUCCESS *** at 06/05/24 02:42:13 AM
Experiment C96C48_hybatmDA_ca024035 *** SUCCESS *** at 06/05/24 03:21:21 AM
Experiment C96C48_ufs_hybatmDA_ca024035 *** SUCCESS *** at 06/05/24 03:36:11 AM
Experiment C96_atm3DVar_extended_ca024035 *** SUCCESS *** at 06/05/24 09:21:30 AM
Yeah, WCOSS2-CI Passed!
@WalterKolczynski-NOAA , RussTreadon-NOAA:feature/wcoss2_ufsda
is one commit behind the current head of develop
. feature/wcoss2_ufsda
does not include the two files committed at 67b833e
.
Shall I update RussTreadon-NOAA:feature/wcoss2_ufsda
? I have no problem doing so, but I am concerned that doing so might trigger another round of CI testing across supported platforms.
Yeah, WCOSS2-CI Passed!
@WalterKolczynski-NOAA ,
RussTreadon-NOAA:feature/wcoss2_ufsda
is one commit behind the current head ofdevelop
.feature/wcoss2_ufsda
does not include the two files committed at67b833e
.Shall I update
RussTreadon-NOAA:feature/wcoss2_ufsda
? I have no problem doing so, but I am concerned that doing so might trigger another round of CI testing across supported platforms.
Nope, we're good.
Thank you @WalterKolczynski-NOAA for working with me on this PR and merging it into develop. Thank you @TerrenceMcGuinness-NOAA for helping us resolve WCOSS2 CI issues.
Description
This PR enables ufsda (
sorc/gdas.cd
) to be built and run on WCOSS2.Resolves #2602 Resolves #2579
Type of change
Change characteristics
How has this been tested?
Clone, build, and run C96C48_ufs_hybatmDA CI on WCOSS2 (Cactus)
Checklist