NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Add the capability to use slurm reservation nodes #2627

Closed guoqing-noaa closed 1 month ago

guoqing-noaa commented 1 month ago

Description

Add the capability to use slurm reservation nodes Add "ACCOUNT_SERVICE" for jobs to run in PARTITION_SERVICE

Resolves #2626

Type of change

Change characteristics

How has this been tested?

Checklist

emcbot commented 1 month ago

CI Update on Wcoss2 at 05/29/24 06:28:14 PM
============================================
Cloning and Building global-workflow PR: 2627
with PID: 254221 on host: clogin01
emcbot commented 1 month ago

Automated global-workflow Testing Results:


Machine: Wcoss2
Start: Wed May 29 18:50:37 UTC 2024 on clogin01
---------------------------------------------------
Build: Completed at 05/29/24 07:03:35 PM
Case setup: Completed for experiment C48_ATM_2aad9b3e
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_2aad9b3e
Case setup: Skipped for experiment C48_S2SWA_gefs_2aad9b3e
Case setup: Completed for experiment C48_S2SW_2aad9b3e
Case setup: Completed for experiment C96_atm3DVar_extended_2aad9b3e
Case setup: Skipped for experiment C96_atm3DVar_2aad9b3e
Case setup: Skipped for experiment C96_atmaerosnowDA_2aad9b3e
Case setup: Completed for experiment C96C48_hybatmDA_2aad9b3e
Case setup: Skipped for experiment C96C48_ufs_hybatmDA_2aad9b3e
emcbot commented 1 month ago

Experiment C48_ATM FAILED on Hercules with error logs:

/work2/noaa/stmp/CI/HERCULES/2627/RUNTESTS/COMROOT/C48_ATM_2aad9b3e/logs/2021032312/gfsatmos_prod_f009-f015.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 1 month ago

Experiment C48_ATM FAILED on Hercules in /work2/noaa/stmp/CI/HERCULES/2627/RUNTESTS/C48_ATM_2aad9b3e

WalterKolczynski-NOAA commented 1 month ago

logfile looks like it was cut off mid-execution. Not even a SIGTERM. The part we have looks fine. May just retry Hercules.

WalterKolczynski-NOAA commented 1 month ago

Log on disk is complete. Ends like this:

End interp_atmos_sflux.sh at 19:42:42 with error code 0 (time elapsed: 00:00:01)
+ exglobal_atmos_products.sh[190]: export err=0
+ exglobal_atmos_products.sh[190]: err=0
+ exglobal_atmos_products.sh[190]: err_chk
 completed cleanly
+ exglobal_atmos_products.sh[193]: IFS=:
+ exglobal_atmos_products.sh[193]: read -ra grids
+ exglobal_atmos_products.sh[194]: for grid in "${grids[@]}"
+ exglobal_atmos_products.sh[195]: prod_dir=COM_ATMOS_GRIB_1p00
+ exglobal_atmos_products.sh[196]: /bin/cp -p sflux_f015_1p00 /work2/noaa/stmp/CI/HERCULES/2627/RUNTESTS/COMROOT/C48_ATM_2aad9b3e/gfs.20210323/12//products/atmos/grib2/1p00/gfs.t12z.flux.1p00.f015
+ exglobal_atmos_products.sh[197]: wgrib2 -s sflux_f015_1p00
+ exglobal_atmos_products.sh[202]: [[ YES == \Y\E\S ]]
+ exglobal_atmos_products.sh[203]: grp=
+ exglobal_atmos_products.sh[204]: ((  FORECAST_HOUR > 0 & FORECAST_HOUR <= FHMAX_WGNE  ))
+ exglobal_atmos_products.sh[206]: wgrib2 /work2/noaa/stmp/CI/HERCULES/2627/RUNTESTS/COMROOT/C48_ATM_2aad9b3e/gfs.20210323/12//products/atmos/grib2/0p25/gfs.t12z.pgrb2.0p25.f015 -d 597 -grib /work2/noaa/stmp/CI/HERCULES/2627/RUNTESTS/COMROOT/C48_ATM_2aad9b3e/gfs.20210323/12//products/atmos/grib2/0p25/gfs.t12z.wgne.f015

*** FATAL ERROR: record 597 not found ***

+ exglobal_atmos_products.sh[1]: postamble exglobal_atmos_products.sh 1717011684 8
+ preamble.sh[70]: set +x
End exglobal_atmos_products.sh at 19:42:42 with error code 8 (time elapsed: 00:01:18)
+ JGLOBAL_ATMOS_PRODUCTS[1]: postamble JGLOBAL_ATMOS_PRODUCTS 1717011674 8
+ preamble.sh[70]: set +x
End JGLOBAL_ATMOS_PRODUCTS at 19:42:42 with error code 8 (time elapsed: 00:01:28)
+ atmos_products.sh[1]: postamble atmos_products.sh 1717011391 8
+ preamble.sh[70]: set +x
End atmos_products.sh at 19:42:43 with error code 8 (time elapsed: 00:06:12)

Doesn't look like anything that would be related to this PR.

WalterKolczynski-NOAA commented 1 month ago

I'm separately getting errors from cron, so I think hercules is just having some issues right now.

emcbot commented 1 month ago

Experiment C48_ATM_2aad9b3e SUCCESS on Wcoss2 at 05/29/24 09:42:12 PM

emcbot commented 1 month ago

Experiment C48_S2SW_2aad9b3e SUCCESS on Wcoss2 at 05/29/24 09:48:13 PM

emcbot commented 1 month ago

Experiment C96C48_hybatmDA_2aad9b3e SUCCESS on Wcoss2 at 05/29/24 10:27:19 PM

emcbot commented 1 month ago

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2627

emcbot commented 1 month ago

CI Passed Orion at
Built and ran in directory /work2/noaa/stmp/CI/ORION/2627

emcbot commented 1 month ago

Experiment C96_atm3DVar_extended_2aad9b3e SUCCESS on Wcoss2 at 05/30/24 04:18:29 AM

emcbot commented 1 month ago

All CI Test Cases Passed on Wcoss2:


Experiment C48_ATM_2aad9b3e *** SUCCESS *** at 05/29/24 09:42:12 PM
Experiment C48_S2SW_2aad9b3e *** SUCCESS *** at 05/29/24 09:48:13 PM
Experiment C96C48_hybatmDA_2aad9b3e *** SUCCESS *** at 05/29/24 10:27:19 PM
Experiment C96_atm3DVar_extended_2aad9b3e *** SUCCESS *** at 05/30/24 04:18:29 AM
emcbot commented 1 month ago

CI Passed Hercules at
Built and ran in directory /work2/noaa/stmp/CI/HERCULES/2627

guoqing-noaa commented 1 month ago

Thanks, @WalterKolczynski-NOAA @aerorahul @DavidHuber-NOAA