NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 167 forks source link

GFSv16.3.? - GLDAS updates #2503

Open KateFriedman-NOAA opened 5 months ago

KateFriedman-NOAA commented 5 months ago

Description

NCO/SPA Justin Cooke is opening several bugzillas related to updated needed to the GLDAS job in operations. This came out of recent issues in operations related to missing CPC gauge data.

Justin's comments:

1) We saw references in the code to an h2 emc space for this input data too:

export CPCGAUGE=${CPCGAUGE:-/lfs/h2/emc/global/noscrub/emc.global/dump}

Production jobs should not reference emc disk spaces. We'll be opening up a Bugzilla ticket for that. 

2) If the CPC data is missing the job will fail, the warning message needs to be changed, currently it is:

if [ ! -s $cpc ]; then
 echo "WARNING: GLDAS MISSING $cpc, WILL NOT RUN."
 exit 3
fi

It needs to be: 

if [ ! -s $cpc ]; then
 echo "FATAL ERROR: GLDAS MISSING $cpc, WILL NOT RUN."
 exit 3
fi

3) This job also runs at 06, 12, 18Z, but at those times it just reports this message:
0.319 + echo 'GLDAS only runs for 00 cycle; Skip GLDAS step for cycle 18'
GLDAS only runs for 00 cycle; Skip GLDAS step for cycle 18

Why does the job exist for those cycles?

Target version

v16.3.?? (TBD)

Expected workflow changes

Initial suggested changes: 1) So that we can retain the ability for EMC developers to still run outside of ops and use the dump data that we store in the global dump archive (that default emc space), we should set export CPCGAUGE=/lfs/h2/emc/global/noscrub/emc.global/dump in the dev-only config.base (config.base.emc.dyn). I am fine with Jiarui's suggestion to change the default in the script to something like /lfs/h1/ops/prod/com/gfs/v16.3. We can pass our dump archive path via the override and our config.base setting. 2) I agree with updating the error message. Let's get that changed to Justin's suggestion. 3) For ecflow in ops we can just remove the GLDAS job from the 06/12/18 job families and adjust job dependencies for those cycles to not wait for that job (the analysis job triggers).

KateFriedman-NOAA commented 5 months ago

Additional email from SPA Justin:

EMC,

Overnight the 00Z run of today's (April 17 2024) atmos/init/jgdas_atmos_gldas job failed due to a missing input file produced by CPC. 

You're probably aware that on Monday April 14 the NCWCP datacenter experienced a cooling failure that resulted in some equipment being damaged. Teams are working to recover the down systems, but the NCEP Centers located in College Park are impacted, this includes CPC. 

The jgdas_atmos_gldas job failed with this error:
WARNING: GLDAS MISSING /lfs/h1/ops/prod/dcom/20240415/wgrbbul/cpc_rcdas/PRCP_CU_GAUGE_V1.0GLB_0.125deg.lnx.20240415.RT, WILL NOT RUN

The last updated file in dcom on both systems is for April 14th:
  /lfs/h1/ops/prod/dcom/20240414/wgrbbul/cpc_rcdas/PRCP_CU_GAUGE_V1.0GLB_0.125deg.lnx.20240414.RT  

Here is the log output for the failed gldas init job:
 /lfs/h1/ops/prod/output/20240417/gdas_atmos_gldas_00.o127712973

We set the failed job complete last night to let the 00Z gdas forecast complete, it ran successfully and appears to have made all the necessary output files, except we don't see the files that end with 'nc_bfgldas' in the /lfs/h1/ops/prod/com/gfs/v16.3/gdas.20240417/00/atmos/RESTART directory, what is the impact of those files not being generated?

We don't have a timeline for when the NCWCP datacenter will be back to full health so it's likely the cpc_rcdas data will remain unavailable. 

What is your recommendation for how we handle a failure of this job if it happens again tomorrow morning? 
Is setting it complete the best response?

Thanks,

Justin 
SPA Team
HelinWei-NOAA commented 5 months ago

Completed the changes for GLDAS and created the tag for the global-workflow

barlage commented 5 months ago

@HelinWei-NOAA is there a PR for these changes? I'm curious what the solution was.

HelinWei-NOAA commented 5 months ago

@barlage We have only made the change for GLDAS. They are very minor. Kate will make the change for workflow.

@HelinWei-NOAA is there a PR for these changes? I'm curious what the solution was.