Open KateFriedman-NOAA opened 5 months ago
Additional email from SPA Justin:
EMC,
Overnight the 00Z run of today's (April 17 2024) atmos/init/jgdas_atmos_gldas job failed due to a missing input file produced by CPC.
You're probably aware that on Monday April 14 the NCWCP datacenter experienced a cooling failure that resulted in some equipment being damaged. Teams are working to recover the down systems, but the NCEP Centers located in College Park are impacted, this includes CPC.
The jgdas_atmos_gldas job failed with this error:
WARNING: GLDAS MISSING /lfs/h1/ops/prod/dcom/20240415/wgrbbul/cpc_rcdas/PRCP_CU_GAUGE_V1.0GLB_0.125deg.lnx.20240415.RT, WILL NOT RUN
The last updated file in dcom on both systems is for April 14th:
/lfs/h1/ops/prod/dcom/20240414/wgrbbul/cpc_rcdas/PRCP_CU_GAUGE_V1.0GLB_0.125deg.lnx.20240414.RT
Here is the log output for the failed gldas init job:
/lfs/h1/ops/prod/output/20240417/gdas_atmos_gldas_00.o127712973
We set the failed job complete last night to let the 00Z gdas forecast complete, it ran successfully and appears to have made all the necessary output files, except we don't see the files that end with 'nc_bfgldas' in the /lfs/h1/ops/prod/com/gfs/v16.3/gdas.20240417/00/atmos/RESTART directory, what is the impact of those files not being generated?
We don't have a timeline for when the NCWCP datacenter will be back to full health so it's likely the cpc_rcdas data will remain unavailable.
What is your recommendation for how we handle a failure of this job if it happens again tomorrow morning?
Is setting it complete the best response?
Thanks,
Justin
SPA Team
Completed the changes for GLDAS and created the tag for the global-workflow
@HelinWei-NOAA is there a PR for these changes? I'm curious what the solution was.
@barlage We have only made the change for GLDAS. They are very minor. Kate will make the change for workflow.
@HelinWei-NOAA is there a PR for these changes? I'm curious what the solution was.
Description
NCO/SPA Justin Cooke is opening several bugzillas related to updated needed to the GLDAS job in operations. This came out of recent issues in operations related to missing CPC gauge data.
Justin's comments:
Target version
v16.3.?? (TBD)
Expected workflow changes
Initial suggested changes: 1) So that we can retain the ability for EMC developers to still run outside of ops and use the dump data that we store in the global dump archive (that default emc space), we should set
export CPCGAUGE=/lfs/h2/emc/global/noscrub/emc.global/dump
in the dev-only config.base (config.base.emc.dyn). I am fine with Jiarui's suggestion to change the default in the script to something like/lfs/h1/ops/prod/com/gfs/v16.3
. We can pass our dump archive path via the override and our config.base setting. 2) I agree with updating the error message. Let's get that changed to Justin's suggestion. 3) For ecflow in ops we can just remove the GLDAS job from the 06/12/18 job families and adjust job dependencies for those cycles to not wait for that job (the analysis job triggers).