NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Prep job failures not captured on exit #691

Open KateFriedman-NOAA opened 2 years ago

KateFriedman-NOAA commented 2 years ago

Expected behavior

The job (prep.sh) would exit with correct exit value thrown by obsproc package scripts.

Current behavior

Exits with a 0 error code regardless of what happens in the job.

Machines affected

All, doesn't matter the machine.

To Reproduce

Point to the wrong obsproc package (e.g. doesn't exist).

Detailed Description

Here is the bottom of a gdasprep.log on Hera that shows it erroring in the obsproc package script but still exiting with exit 0:

+ 13.467s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
+ 13.469s + [ -n '' ]
+ 13.469s + set -e
+ 13.469s + kill -n 9 291727
/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/scripts/exglobal_makeprepbufr.sh.ecf: line 81: 291727: Killed
+ 13.494s + errsc=265
+ 13.494s + [ 265 -ne 0 ]
+ 13.494s + exit 265
/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed                  $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf
+ 14s + eval err_gdas_makeprepbufr=137
++ 14s + err_gdas_makeprepbufr=137
+ 14s + eval '[[' '$err_gdas_makeprepbufr' -ne 0 ']]'
++ 14s + [[ 137 -ne 0 ]]
+ 13.470s + exit 7
+ 14s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
++ 0s + '[' -n '' ']'
++ 0s + set -e
++ 0s + kill -n 9
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
++ 14s + hostname
++ 14s + date -u
+ 14s + echo ' h16c01  --  Wed Mar 16 17:52:41 UTC 2022'
+ 14s + '[' -n '' ']'
+ 14s + '[' -n '' ']'
+ 14s + '[' NO '!=' YES ']'
+ 14s + cd /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr
+ 14s + rm -rf /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314
+ 14s + date -u
Wed Mar 16 17:52:41 UTC 2022
+ 14s + exit
+ status=0
+ [[ 0 -ne 0 ]]
+ exit 0

This part of jobs/rocoto/prep.sh isn't get the error code coming out of JGLOBAL_PREP:

105     $HOMEobsproc_network/jobs/JGLOBAL_PREP
106     status=$?
107     [[ $status -ne 0 ]] && exit $status

Possible Implementation

Change lines 106 and 107 in jobs/rocoto/prep.sh to get the correct error code variable that comes out of JGLOBAL_PREP.

lgannoaa commented 2 years ago

@KateFriedman-NOAA from what you reported, error code 137 was successfully caught, the exit 7 was issued. The issue here is the "kill -n 9" instruction has error in syntax.

/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf

KateFriedman-NOAA commented 2 years ago

@lgannoaa While an exit code was caught coming out of one of the OBSPROC scripts, the final exit code in our prep.sh was 0, which is incorrect. See the final lines of the gdasprep.log I show above.

lgannoaa commented 2 years ago

@KateFriedman-NOAA , the error code was caught and error exit was executed. That is correct behavior. The error exit kill statement has wrong syntax. I recommend contact system support to find out why this error exit was not working. ++ 0s + kill -n 9 kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]

CoryMartin-NOAA commented 8 months ago

@kevindougherty-noaa found that this is indeed still an issue. Hera stmp disk quota exceeded earlier in the week and caused the prep job to not produce correct results, but returned an exit code 0. It was only 4 cycles later when the fit2obs job ran that it was discovered that the prepbufr file was not generated.

aerorahul commented 8 months ago

@CoryMartin-NOAA We have identified the root cause of this issue. The failure is in the obsproc code base and even though that job fails, it returns with an exit code 0. The prep job in the global-workflow does not examine the prepbufr file contents and relies on the obsproc j-job JOBSPROC_GLOBAL_PREP to provide the correct exit code. This has been raised with the obsproc developers.

ilianagenkova commented 8 months ago

I am tagging myself here so it gets on my "to do" list @ilianagenkova

DavidHuber-NOAA commented 6 months ago

On Hercules, prepobs_prepdata executable is crashing due to the missing MKL library only available on Orion, though the gdasprep and gfsprep jobs are continuing to process thereafter. It seems like a bug that the failure of prepobs_prepdata does not stop the processing of the *prep jobs. An example log file is available here: /work/noaa/global/dhuber/SAVELOGS/cycled_herc2/2021110900/gdasprep.log.

ilianagenkova commented 6 months ago

I started looking into this but it's more complicated than simply checking error status and not proceeding further. The code has some intentional "hard crashes" and "silent errors", so we need to understand the reasons for it before changing the code. For example, if a critical data set can't be processed in prepobs, the code crashes in order to get someone's attention - not an elegant solution, but that's how it's done now.

DavidHuber-NOAA commented 6 months ago

@ilianagenkova That's good to know. I will mention that this particular problem caused a downstream failure of fit2obs, which fails because the prepbufr file is never generated (by prepobs_prepdata). Granted, fit2obs is a validation piece, but it does stop the cycling process as archiving will not start until fit2obs finishes successfully. I wonder if fit2obs should be amended to finish it's work if the prepbufr is missing.

KateFriedman-NOAA commented 6 months ago

@DavidHuber-NOAA In general we don't want the prepbufr file to be missing but part of the problem is that the analysis runs without it (technically, we don't want it to) and the problem documented in this issue means that if the prepbufr file isn't created no one knows without checking since jobs in the cycle don't fail until fit2obs (if it's on). So at the very least, for this issue, we'll want to check for prepbufr existence at the end of the prep job (in the workflow scripts) while waiting for prepobs/obsproc to make updates.

ilianagenkova commented 6 months ago

@KateFriedman-NOAA , if this is only dev runs issue, one can default to using the production prepbufr file (if you don't want an experiment to stop) and send notification (mailx) to the developer that something needs to be looked at. Just a thought...

WalterKolczynski-NOAA commented 6 months ago

We do want the experiment to stop. The issue is prepobs isn't exiting with a non-zero code, so the workflow thinks it was successful and continues on. This makes identifying the root cause of failures down the line difficult, because the issue was actually in prepobs.

KateFriedman-NOAA commented 6 months ago

one can default to using the production prepbufr file

So for development we can't use the production prepbufr because then the output from the experiment's prior cycle won't be included and you'd be resetting the experiment. We need the generated prepbufr each cycle in our experiments.