NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Enable wcoss2 ufsda build and module load #2620

Closed RussTreadon-NOAA closed 4 weeks ago

RussTreadon-NOAA commented 1 month ago

Description

This PR enables ufsda (sorc/gdas.cd) to be built and run on WCOSS2.

Resolves #2602 Resolves #2579

Type of change

Change characteristics

How has this been tested?

Clone, build, and run C96C48_ufs_hybatmDA CI on WCOSS2 (Cactus)

Checklist

RussTreadon-NOAA commented 1 month ago

NOTE: This PR requires

This PR will be marked Ready for review once these tasks are completed.

WalterKolczynski-NOAA commented 1 month ago

@RussTreadon-NOAA As part of this PR, please also enable the CI tests that use the new JEDI-based GDAS that are currently disabled on wcoss. You can do this by removing wcoss2 from the skip_ci_on_hosts section in the ci/cases/*/*.yaml case files.

RussTreadon-NOAA commented 1 month ago

@WalterKolczynski-NOAA , I can remove wcoss2 from ci/cases/pr/C96C48_ufs_hybatmDA.yaml since I tested this on Cactus. It works.

Are you asking that this PR remove the following occurrences of `wcoss2 in CI yamls?

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml:  - wcoss2
ci/cases/pr/C96C48_ufs_hybatmDA.yaml:  - wcoss2
ci/cases/pr/C96_atmaerosnowDA.yaml:  - wcoss2
ci/cases/pr/C96_atm3DVar.yaml:  - wcoss2
ci/cases/pr/C48_S2SWA_gefs.yaml:  - wcoss2

I do not plan on testing anything other than C96C48_ufs_hybatmDA.yaml

Added note: This PR will remain in draft mode until NCO installs bufr/12.0.1 in production. Once this is done, wcoss2.intel.lua in GDASApp PR #1122 will be updated to use the official production installation of bufr/12.0.1. After GDASApp PR #1122 is closed, the sorc/gdas.cd hash in this PR will be updated.

WalterKolczynski-NOAA commented 1 month ago

@WalterKolczynski-NOAA , I can remove wcoss2 from ci/cases/pr/C96C48_ufs_hybatmDA.yaml since I tested this on Cactus. It works.

Are you asking that this PR remove the following occurrences of `wcoss2 in CI yamls?

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml:  - wcoss2
ci/cases/pr/C96C48_ufs_hybatmDA.yaml:  - wcoss2
ci/cases/pr/C96_atmaerosnowDA.yaml:  - wcoss2
ci/cases/pr/C96_atm3DVar.yaml:  - wcoss2
ci/cases/pr/C48_S2SWA_gefs.yaml:  - wcoss2

I do not plan on testing anything other than C96C48_ufs_hybatmDA.yaml

Added note: This PR will remain in draft mode until NCO installs bufr/12.0.1 in production. Once this is done, wcoss2.intel.lua in GDASApp PR #1122 will be updated to use the official production installation of bufr/12.0.1. After GDASApp PR #1122 is closed, the sorc/gdas.cd hash in this PR will be updated.

Not all of them. The GEFS test has to remain off until the bash CI system supports dual build (GFS and GEFS use different UFS executables because of the wave grid option). I'm also not sure why the C96_atm3DVar test isn't on already, will check.

The other three should work as soon as gdas.cd can be built, AFAIK. If they doesn't work out-of-the-box, we can get you help or defers those.

RussTreadon-NOAA commented 1 month ago

I'll keep it simple at first an only activate C96C48_ufs_hybatmDA on wcoss2

WalterKolczynski-NOAA commented 1 month ago

Not all of them. The GEFS test has to remain off until the bash CI system supports dual build (GFS and GEFS use different UFS executables because of the wave grid option). I'm also not sure why the C96_atm3DVar test isn't on already, will check.

The other three should work as soon as gdas.cd can be built, AFAIK. If they doesn't work out-of-the-box, we can get you help or defers those.

Oh, the C96_atm3DVar test is disable because we run the extended version instead.

RussTreadon-NOAA commented 1 month ago

Build RussTreadon-NOAA:feature/wcoss2_ufsda at 10a2bc5d on Cactus. Run JEDI ATM CI. 20240224/00 gfs and gdas cycles run to completion. 20240224/00 enkf cycle fails in the final job because member analysis increment files are not found in the expected format.

gdas.cd @ 95218e7 includes changes related to g-w PR #2592. This g-w PR adds a new enkf analysis increment job. gdas.cd @ 95218e7 assumes member increments are created by this new g-w job.

g-w PR #2592 must be merged into develop and RussTreadon-NOAA:feature/wcoss2_ufsda updated in order for the enkf cycle to successfully complete.

RussTreadon-NOAA commented 1 month ago

Install RussTreadon-NOAA:feature/wcoss2_ufsda at 7f7093f on Cactus. Run JEDI ATM CI (C96C48_ufs_hybatmDA). All jobs from gfs, gdas, and enkfgdas cycles successfully ran to completion

russ.treadon@clogin04:/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prtest> rocotostat -d prtest.db -w prtest.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202402231800        Done    May 29 2024 00:25:41    May 29 2024 00:40:16
202402240000        Done    May 29 2024 00:25:41    May 29 2024 02:45:11
RussTreadon-NOAA commented 1 month ago

@DavidHuber-NOAA and @CatherineThomas-NOAA , if either of you have time would you review the changes in this PR?

This PR allows GDASApp to be built and run on WCOSS2. This capability is required for GFS v17.

RussTreadon-NOAA commented 1 month ago

While not impacted by this PR, also run GSI-based ATM CI (C96C48_hybatmDA) on Cactus. All jobs successfully run to completion.

russ.treadon@clogin04:/lfs/h2/emc/ptmp/russ.treadon/EXPDIR/prtest_gsi> rocotostat -d prtest_gsi.db -w prtest_gsi.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED     
202112201800        Done    May 29 2024 09:35:34    May 29 2024 09:50:13
202112210000        Done    May 29 2024 09:35:34    May 29 2024 11:30:16
202112210600        Done    May 29 2024 09:35:34    May 29 2024 11:30:16
WalterKolczynski-NOAA commented 1 month ago

Does this need to be updated with that GSI utils hash before testing?

RussTreadon-NOAA commented 1 month ago

I did not plan on updating the hash for sorc/gsi_utils.fd in order to keep this PR focused on the stated focus of this PR - enable UFSDA to build and run on WCOSS2. The current sorc/gsi_utils.fd hash works for GSI and JEDI based DA.

GSI-utils PR #44 removed the use of /apps/ops/para/libs and updated the version for a few modules. JEDI and GSI based CI tests demonstrated that this PR did not alter cycled results. It does, however, bring the package into better compliance with NCO implementation standards (e.g., do not build apps with non-production modules).

Given this plus your question, @WalterKolczynski-NOAA, I'll go ahead and update the sorc/gsi_utils.fd hash in this PR to d940406

RussTreadon-NOAA commented 1 month ago

Thank you @CatherineThomas-NOAA

emcbot commented 1 month ago

CI Update on Wcoss2 at 05/30/24 02:52:08 PM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 81085 on host: clogin01
emcbot commented 1 month ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Thu May 30 15:00:16 UTC 2024 on clogin01
---------------------------------------------------
Build: Completed at 05/30/24 03:12:05 PM
Case setup: Completed for experiment C48_ATM_d6f6ae0c
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_d6f6ae0c
Case setup: Skipped for experiment C48_S2SWA_gefs_d6f6ae0c
Case setup: Completed for experiment C48_S2SW_d6f6ae0c
Case setup: Completed for experiment C96_atm3DVar_extended_d6f6ae0c
Case setup: Skipped for experiment C96_atm3DVar_d6f6ae0c
Case setup: Skipped for experiment C96_atmaerosnowDA_d6f6ae0c
Case setup: Completed for experiment C96C48_hybatmDA_d6f6ae0c
Case setup: Completed for experiment C96C48_ufs_hybatmDA_d6f6ae0c
emcbot commented 1 month ago

Experiment C96C48_ufs_hybatmDA_d6f6ae0c FAIL on Wcoss2 at 05/30/24 03:48:40 PM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_d6f6ae0c/logs/2024022400/gdasprepatmiodaobs.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_d6f6ae0c/logs/2024022400/gfsprepatmiodaobs.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 month ago

Experiment C48_S2SW FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2620/RUNTESTS/COMROOT/C48_S2SW_d6f6ae0c/logs/2021032312/gfswaveinit.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 month ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C96_atmaerosnowDA_d6f6ae0c

emcbot commented 1 month ago

Experiment C96_atm3DVar FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C96_atm3DVar_d6f6ae0c

emcbot commented 1 month ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C48mx500_3DVarAOWCDA_d6f6ae0c

emcbot commented 1 month ago

Experiment C96C48_hybatmDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C96C48_hybatmDA_d6f6ae0c

emcbot commented 1 month ago

Experiment C48_S2SW FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C48_S2SW_d6f6ae0c

emcbot commented 1 month ago

Experiment C48_ATM FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2620/RUNTESTS/COMROOT/C48_ATM_d6f6ae0c/logs/2021032312/gfsfcst.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 month ago

Experiment C48_S2SWA_gefs FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C48_S2SWA_gefs_d6f6ae0c

emcbot commented 1 month ago

Experiment C48_ATM FAILED on Hera in /scratch1/NCEPDEV/global/CI/2620/RUNTESTS/C48_ATM_d6f6ae0c

emcbot commented 1 month ago

Experiment C48_S2SWA_gefs FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/logs/2021032312/atmos_prod_mem002_f066.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 month ago

Experiment C48_S2SWA_gefs FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/C48_S2SWA_gefs_d6f6ae0c

RussTreadon-NOAA commented 1 month ago

Experiment C48_S2SW FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2620/RUNTESTS/COMROOT/C48_S2SW_d6f6ae0c/logs/2021032312/gfswaveinit.log

Follow link here to view the contents of the above file(s): (link)

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_d6f6ae0c/logs/2024022400/gdasprepatmiodaobs.log contains the following error message

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/jobs/JGLOBAL_ATM_PREP_IODA_OBS: line 21: /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/ush/run_bufr2ioda.py: No such file or directory

run_bufr2ioda.py exists in sorc/gdas.cd/ush/ioda/bufr2ioda/

russ.treadon@clogin03:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc/gdas.cd/ush/ioda/bufr2ioda> ls -l run_bufr2ioda.py
-rwxr-xr-x 1 terry.mcguinness global 4682 May 30 14:54 run_bufr2ioda.py

Script sorc/link_workflow.sh should link this script to ush/ via

#------------------------------                                                                                                                                 
#--add GDASApp files                                                                                                                                            
#------------------------------                                                                                                                                 
if [[ -d "${HOMEgfs}/sorc/gdas.cd/build" ]]; then
  cd "${HOMEgfs}/ush" || exit 1
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/soca"                              .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ufsda"                              .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/jediinc2fv3.py"                     .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ioda/bufr2ioda/gen_bufr2ioda_json.py"    .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ioda/bufr2ioda/gen_bufr2ioda_yaml.py"    .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ioda/bufr2ioda/run_bufr2ioda.py"    .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/build/bin/imsfv3_scf2ioda.py"           .
fi

Note that the creation of links is dependent upon the existence of gdas.cd/build. A check of /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc/gdas.cd does not show build being present. A check of sorc/logs does not include build_gdas.log.

Was sorc/build_all.sh executed with the -u option to build GDASApp?

Note: Some of the JEDI components in GDASApp executes git-lfs during the build process. While I think g-w build_gdas.sh will run to completion without git-lfs, the log file may contain errors.

RussTreadon-NOAA commented 1 month ago

According to account_params all Hera stmp filesets are over quota

Hera(hfe03):/scratch1/NCEPDEV/da/role.jedipara$ date
Thu May 30 16:33:31 UTC 2024
Hera(hfe03):/scratch1/NCEPDEV/da/role.jedipara$ account_params |grep stmp
        Project: stmp
                Directory: /scratch1/NCEPDEV/stmp DiskInUse=726315 GB, Quota=700000 GB, Files=36307512, FileQUota=140000000
                Directory: /scratch2/NCEPDEV/stmp DiskInUse=710260 GB, Quota=700000 GB, Files=34504518, FileQUota=140000000
                Directory: /scratch2/NCEPDEV/stmp1 DiskInUse=710260 GB, Quota=700000 GB, Files=34504518, FileQUota=140000000
                Directory: /scratch1/NCEPDEV/stmp2 DiskInUse=726315 GB, Quota=700000 GB, Files=36307514, FileQUota=140000000
                Directory: /scratch2/NCEPDEV/stmp3 DiskInUse=710260 GB, Quota=700000 GB, Files=34504518, FileQUota=140000000
                Directory: /scratch1/NCEPDEV/stmp4 DiskInUse=726315 GB, Quota=700000 GB, Files=36307514, FileQUota=140000000

Do the Hera CI tests use any stmp directories?

TerrenceMcGuinness-NOAA commented 1 month ago

@RussTreadon-NOAA CI does only in the fact that is where RUNDIRS is defined for experiments and I just mad sure ours was clean. I have STMP set at /scratch1/NCEPDEV/stmp2 for CI on Hera.

emcbot commented 1 month ago

CI Passed Orion at
Built and ran in directory /work2/noaa/stmp/CI/ORION/2620

WalterKolczynski-NOAA commented 1 month ago

ci/scripts/clone-build_ci.sh needs to be updated to build with UFSDA on WCOSS2. Then I think @TerrenceMcGuinness-NOAA will need to update the copy we use on Cactus to drive CI.

WalterKolczynski-NOAA commented 1 month ago

I have no idea what this Hercules failure is.

RussTreadon-NOAA commented 1 month ago

Look at /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/logs/2021032312/atmos_prod_mem002_f066.log. The error message is

+ exglobal_atmos_products.sh[113]: ((  iproc == nproc  ))
+ exglobal_atmos_products.sh[118]: wgrib2 tmpfile_f066 -for 1:0 -grib tmpfile_f066_1

*** FATAL ERROR: parse_loop: end < start 1:0 ***

+ exglobal_atmos_products.sh[1]: postamble exglobal_atmos_products.sh 1717085522 8

Can someone from the GEFS team help troubleshoot?

Spot check GSI-based DA job log files in /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C96C48_hybatmDA_d6f6ae0c/logs. All the log files I checked finished with error code 0. Seems this test was successful.

DavidHuber-NOAA commented 1 month ago

@WalterKolczynski-NOAA @RussTreadon-NOAA It looks like the 66-hour master GRIB2 file from member 2 was truncated. Below are the sizes of the master GRIB2 files:

9.2M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f000
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f006
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f012
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f018
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f024
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f030
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f036
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f042
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f048
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f054
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f060
3.2M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f066
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f072
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f078
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f084
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f090
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f096
9.4M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f102
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f108
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f114
9.5M /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/gefs.20210323/12/mem002/model_data/atmos/master/gefs.t12z.master.grb2f120

Note that the 66-hour forecast is 3.2MB while all others are ~9.5MB in size. Looking through the forecast log /work2/noaa/stmp/CI/HERCULES/2620/RUNTESTS/COMROOT/C48_S2SWA_gefs_d6f6ae0c/logs/2021032312/fcst_mem002.log, I do not see any suspicious messages surrounding the creation of this file.

The node that the forecast ran on (hercules-02-18) reported several failed jobs yesterday than surrounding nodes (16 failed on 02-18 vs an average of 4.5 on 4 randomly selected nodes on 02-) , suggesting it may be a node issue. I'll report this to RDHPCS and see if they notice/noticed any issues with that node.

WalterKolczynski-NOAA commented 1 month ago

Just want to note that I believe the Hercules failure here happened before stmp filled, so I think David is on the right track with a failing node.

RussTreadon-NOAA commented 1 month ago

Thank you @DavidHuber-NOAA for digging into the cause of the Hercules failure. Just curious: How did you know that more jobs than average failed on hercules-02-18? Is this information in a log file or obtained from a command we run?

RussTreadon-NOAA commented 1 month ago

@DavidHuber-NOAA . OK. I see in your RDHPCS email

sacct -a --start 053024 --end 053124 -o "JobID,JobName%60,State,NodeList%60" -N "hercules-02-18" | grep FAIL 

That's quite a mouthful. I didn't know about this combination of options with this command.

guillaumevernieres commented 1 month ago

@WalterKolczynski-NOAA , @RussTreadon-NOAA , I believe, this

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml:  - wcoss2

will not work until the bufr library is updated

RussTreadon-NOAA commented 1 month ago

@guillaumevernieres , which version of bufr does ci/cases/pr/C48mx500_3DVarAOWCDA.yaml require?

The gdas.cd hash used in this PR loads bufr/12.0.1 on WCOSS2 - see modulefiles/GDAS/wcoss2.intel.lua

guillaumevernieres commented 1 month ago

@guillaumevernieres , which version of bufr does ci/cases/pr/C48mx500_3DVarAOWCDA.yaml require?

The gdas.cd hash used in this PR loads bufr/12.0.1 on WCOSS2 - see modulefiles/GDAS/wcoss2.intel.lua

Ha! never mind, I missed the memo again @RussTreadon-NOAA .

RussTreadon-NOAA commented 1 month ago

@WalterKolczynski-NOAA : Are we waiting for other g-w PRs to pass CI and be merged into develop or is there something I need to do with this PR to move it forward?

DavidHuber-NOAA commented 1 month ago

@RussTreadon-NOAA @WalterKolczynski-NOAA Since Renn asked that we try again, I'd suggest that we wait to see how the archiving CI test goes. If it passes on Hercules, then I would suggest opening this one up again.

emcbot commented 1 month ago

CI Update on Wcoss2 at 06/01/24 05:12:41 AM
============================================
Cloning and Building global-workflow PR: 2620
with PID: 34445 on host: clogin01
emcbot commented 1 month ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Sat Jun  1 05:21:08 UTC 2024 on clogin01
---------------------------------------------------
Build: Completed at 06/01/24 05:32:43 AM
Case setup: Completed for experiment C48_ATM_5bc05547
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_5bc05547
Case setup: Skipped for experiment C48_S2SWA_gefs_5bc05547
Case setup: Completed for experiment C48_S2SW_5bc05547
Case setup: Completed for experiment C96_atm3DVar_extended_5bc05547
Case setup: Skipped for experiment C96_atm3DVar_5bc05547
Case setup: Skipped for experiment C96_atmaerosnowDA_5bc05547
Case setup: Completed for experiment C96C48_hybatmDA_5bc05547
Case setup: Completed for experiment C96C48_ufs_hybatmDA_5bc05547
emcbot commented 1 month ago

Experiment C96C48_ufs_hybatmDA_5bc05547 FAIL on Wcoss2 at 06/01/24 06:18:28 AM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_5bc05547/logs/2024022400/gfsprepatmiodaobs.log

Follow link here to view the contents of the above file(s): (link)

RussTreadon-NOAA commented 1 month ago

Experiment C96C48_ufs_hybatmDA_5bc05547 FAIL on Wcoss2 at 06/01/24 06:18:28 AM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/RUNTESTS/COMROOT/C96C48_ufs_hybatmDA_5bc05547/logs/2024022400/gfsprepatmiodaobs.log

Follow link here to view the contents of the above file(s): (link)

@TerrenceMcGuinness-NOAA and @WalterKolczynski-NOAA

A check of gfsprepatmiodaobs.log shows the same error as before. Script ush/run_bufr2ioda.py can not be found. This file resides in sorc/gdas.cd. link_workf.low.sh links it to ush/ when gdas.cd/build is present.

#------------------------------                                                                                                              
#--add GDASApp files                                                                                                                         
#------------------------------                                                                                                              
if [[ -d "${HOMEgfs}/sorc/gdas.cd/build" ]]; then
  cd "${HOMEgfs}/ush" || exit 1
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/soca"                              .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ufsda"                              .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/jediinc2fv3.py"                     .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ioda/bufr2ioda/gen_bufr2ioda_json.py"    .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ioda/bufr2ioda/gen_bufr2ioda_yaml.py"    .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/ush/ioda/bufr2ioda/run_bufr2ioda.py"    .
  ${LINK_OR_COPY} "${HOMEgfs}/sorc/gdas.cd/build/bin/imsfv3_scf2ioda.py"           .
fi

A check of sorc/gdas.cd shows that the GDASApp was not built

russ.treadon@clogin01:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc/gdas.cd> ls
build.sh  bundle  ci  CMakeLists.txt  LICENSE  mains  modulefiles  parm  prototypes  README.md  scripts  sorc  test  ush  utils

A check of sorc/logs shows that build_gdas.log is not present

russ.treadon@clogin01:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc/logs> ls
build_gfs_utils.log  build_gsi_monitor.log  build_ufs.log        build_upp.log
build_gsi_enkf.log   build_gsi_utils.log    build_ufs_utils.log  build_ww3prepost.log

Before we run C96C48_ufs_hybatmDA., we must include the -u option when executing sorc/build.sh. Given that we are also exercising GSI-based DA, too, our build command should be build_all.sh -g -u. I like to add -v for verbose output but doing so is optional

RussTreadon-NOAA commented 1 month ago

A grep "build_all.sh" in /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow finds build_all.sh commands in ./ci/cases/yamls/build.yaml

builds:
 - gefs: './build_all.sh -kw'
 - gfs: './build_all.sh -kgu'

The -u option is present for gfs builds. I also see system: gfs specified in C96C48_ufs_hybatmDA.yaml

russ.treadon@clogin01:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/ci/cases/pr> grep system -r .
./C48mx500_3DVarAOWCDA.yaml:  system: gfs
./C48_ATM.yaml:  system: gfs
./C48_S2SW.yaml:  system: gfs
./C96_atm3DVar_extended.yaml:  system: gfs
./C96_atmaerosnowDA.yaml:  system: gfs
./C96_atm3DVar.yaml:  system: gfs
./C48_S2SWA_gefs.yaml:  system: gefs
./C96C48_ufs_hybatmDA.yaml:  system: gfs
./C96C48_hybatmDA.yaml:  system: gfs

Thus, I can not explain why GDASApp was not built on Cactus.

emcbot commented 1 month ago

CI Passed Hercules at
Built and ran in directory /work2/noaa/stmp/CI/HERCULES/2620

RussTreadon-NOAA commented 1 month ago

@TerrenceMcGuinness-NOAA and @WalterKolczynski-NOAA

git statusin/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc` returns

russ.treadon@clogin01:/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/sorc> git status .
Error cleaning LFS object: open /lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2620/global-workflow/.git/modules/sorc/gdas.cd/modules/sorc/crtm/lfs/tmp/576068607: no such file or directory
error: external filter 'git-lfs filter-process' failed
fatal: test/testinput/single_profile.nc4: clean filter 'lfs' failed
fatal: 'git status --porcelain=2' failed in submodule sorc/crtm
fatal: 'git status --porcelain=2' failed in submodule sorc/gdas.cd

The GDASApp clone requires git-lfs for JEDI components. Do the above errors result in automated CI abandoning the GDASApp build?

We have git-lfs/2.11.0 on WCOSS2. Loading it requires one of the following gcc compilers be loaded.

      gcc/10.3.0
      gcc/11.2.0
      gcc/12.1.0

My ~russ.treadon/.bashrc contains

module load gcc/12.1.0     # gcc is required to load git-lfs                                                                                 
module load git-lfs/2.11.0

Also, my ~russ.treadon/.gitconfig contains

[filter "lfs"]
        clean = git-lfs clean -- %f
        smudge = git-lfs smudge -- %f
        process = git-lfs filter-process
        required = true

Does the account under which automated CI runs on Cactus include the above? If not, should we add the above?