gfsmetp jobs do not generate stats files on Hera and Hercules

RussTreadon-NOAA commented 1 month ago

What is wrong?

gfsmetp jobs run in Hera testprepobs and Hercules testprepobsherc (see issue #2694) finish with status=0 (success) but no metplus stats files are generated. For example, Hera 2021122200/gfsmetpg2g1.log contains

gfsmetpg2g1.log:** WARNING: /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/testprepobs/testprepobs_2021122200.stat was not generated or zero size
gfsmetpg2g1.log:** WARNING: /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/pres/testprepobs/testprepobs_2021122200.stat was not generated or zero size
gfsmetpg2g1.log:** WARNING: /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/sfc/testprepobs/testprepobs_2021122200.stat was not generated or zero size

What should have happened?

gfsmetp jobs run in WCOSS2 (Dogwood) testprepobsherc (see issue #2694) finish with status=0 (success) AND create non-zero length metplus stats files. for example,

ls -l /lfs/h2/emc/global/noscrub/kate.friedman/archive/metplus_data/by_VSDB/grid2grid/anom/00Z/devcycprepobs/
total 180
-rw-r--r-- 1 kate.friedman global  59967 Jul  8 17:36 devcycprepobs_20211221.stat
-rw-r--r-- 1 kate.friedman global 119726 Jul  8 21:17 devcycprepobs_20211222.stat

Non-zero size stats files should also be created in the Hera and Hercules parallels.

What machines are impacted?

Hera, Hercules

Steps to reproduce

Set up and run the parallels described in issue #2694

Additional information

gfsmetp*log files on Hera and Hercules both contain srun: error messages. For example,

Hera(hfe03):/scratch2/NCEPDEV/stmp1/Kate.Friedman/comrot/testprepobs/logs/2021122200$ grep "srun: error" gfsmetp*log
gfsmetpg2g1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_job_scripts/poe_jobs1 invalid, no record for task id 1
gfsmetpg2g1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_job_scripts/poe_jobs2 invalid, no record for task id 1
gfsmetpg2g1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2g1.754908/grid2grid_step1/metplus_job_scripts/poe_jobs3 invalid, no record for task id 1
gfsmetpg2o1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2o1.693766/grid2obs_step1/metplus_job_scripts/poe_jobs1 invalid, no record for task id 1
gfsmetpg2o1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metpg2o1.693766/grid2obs_step1/metplus_job_scripts/poe_jobs2 invalid, no record for task id 1
gfsmetppcp1.log:srun: error: Configuration file /scratch2/NCEPDEV/stmp1/Kate.Friedman/RUNDIRS/testprepobs/metppcp1.694915/precip_step1/metplus_job_scripts/poe_jobs1 invalid, no record for task id 1

Do you have a proposed solution?

Why is the srun: error message generated? Would fixing this error result in metplus stats files being generated?

CatherineThomas-NOAA commented 1 month ago

Confirming that I am seeing the same thing in my C384 experiments, both hybrid ATM-only and 3DVar S2S. Those experiments use workflows from July 5 and July 3 respectively.

RussTreadon-NOAA commented 1 month ago

Added the following lines back to the metp section of config.resources

    export npe_metp_gfs=4
    export npe_node_metp_gfs=4

Rerun of Hera 2024050400 gfsmetpg2g1. This time the job generated metplus stats files.

Hera(hfe07):/scratch1/NCEPDEV/da/role.jedipara/archive/metplus_data/by_VSDB/grid2grid$ ls -l */00Z/prjedi/
anom/00Z/prjedi/:
total 112
-rw-r--r-- 1 role.jedipara da 113803 Jul 11 18:30 prjedi_20240504.stat

pres/00Z/prjedi/:
total 308
-rw-r--r-- 1 role.jedipara da 313780 Jul 11 18:30 prjedi_20240504.stat

sfc/00Z/prjedi/:
total 168
-rw-r--r-- 1 role.jedipara da 171584 Jul 11 18:30 prjedi_20240504.stat

For details see /scratch1/NCEPDEV/stmp2/role.jedipara/COMROOT/prjedi/logs/2024050400/gfsmetpg2g1.log.

The _gfs lines were removed from the metp section of config.resources at b902c0b. See g-w PR #2672 for details.

DavidHuber-NOAA commented 1 month ago

@RussTreadon-NOAA Interesting. The issue appears to reside in verif-global.fd/ush/run_verif_global_in_global_workflow.sh, line 252:

export nproc=${npe_node_metp_gfs:-1}

This should probably be looking for npe_metp rather than npe_node_metp. Also, this should probably be a variable exported in the env files rather than cross-referencing a resource variable from the global-workflow. I will make updates to verif-global and the global-workflow then test this.

RussTreadon-NOAA commented 1 month ago

@DavidHuber-NOAA , your solution works. I made the recommended changes in a working copy of g-w on Hera:

return config.resources to its previous state .. that is, remove the _gfs variables from the metp section
comment out the nproc line in verif-global.fd/ush/run_verif_global_in_global_workflow.sh
add a metp section to HERA.env containing export nproc=${npe_metp:-1}

The prjedi 2024050400 gfsmetp jobs were rewound and rebooted. Each successfully ran to completion with metplus stats files generated.

CatherineThomas-NOAA commented 1 month ago

One thing that concerns me is that the metp jobs did not produce any stat files, yet the jobs did not fail and were marked "SUCCEEDED". Is there an easy way we can improve the scripts to catch this in the future?

RussTreadon-NOAA commented 1 month ago

I agree @CatherineThomas-NOAA . The error code 0 status for the failed gfsmetp jobs is troubling. I only noticed this silent fail because I am running a short ATM JEDI parallel and as was checking all the output to ensure jobs ran correctly.

The failed gfsmetp jobs contain warnings that no stats files were produced. For example the 2024050400 gfsmetpg2g1.log.3 contains

+ exgrid2grid_step1.sh[106]: python /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py
BEGIN: copy_stat_files.py
**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1223238/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/prjedi/prjedi_2024050400.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1223238/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/pres/prjedi/prjedi_2024050400.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1223238/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/sfc/prjedi/prjedi_2024050400.stat was not generated or zero size
**************************************************

END: copy_stat_files.py

Script copy_stat_files.py could be modified to generate an error code when the above situation is encountered. That being said, perhaps there are reasons this is not done in the current version of copy_stat_files.py. I don't know the development history of nor the uses cases for copy_stat_files.py.

Tagging @malloryprow for awareness.

malloryprow commented 1 month ago

Hi @RussTreadon-NOAA, I think we had the gfsmetp jobs fail silently so they didn't hold up the rest of the workflow if things failed. I think this is something that was discussed a long time ago.

Did you find why the jobs failed?

RussTreadon-NOAA commented 1 month ago

@malloryprow , yes @DavidHuber-NOAA found the reason for the failure (see above).

The danger with silent failures is that a developer could run a parallel assuming gfsmetp success means stats are generated only to find that the stats are missing when (s)he goes to plot them. Of course, generating the stats after the fact isn't hard. It just takes additional time.

malloryprow commented 1 month ago

Ah missed those! I definitely get not wanting the silent failure. It should be something easily fixed. If no stat files were copied, exit with an error code. Does that sound appropriate?

DavidHuber-NOAA commented 1 month ago

I have a couple of ideas on how to resolve this and maintain the ability to push past failed METplus jobs:

Add a configuration flag, say CHECK_METplus, that will cause a failure in the METplus jobs if the expected data file is missing or size 0 (perhaps more robust checking could be added as well for valid entries).
A file check could be added to the cleanup rocoto XML so users would receive an email if there was a failure. Something like:

<sh shell="/bin/sh"><cyclestr>&HOMEgfs;/ush/check_metp.sh &ARCDIR;/metplus_data/by_VSDB/grid2grid/anom/@HZ/&pslot;/&pslot;_@Y@m@d.stat</cyclestr></sh>

check_metp.sh would then consist of something like the following:

filename=${1:-""}
[ -z "${filename}" ] && echo "METplus filename is an empty string" && exit 0
[ ! -f "${1}" ] && echo "METplus output file $1 does not exist!" && exit 0
[ -f "$1" ] && echo "METplus output file $1 is zero-sized!" && exit 0

malloryprow commented 1 month ago

I can see that being a helpful avenue. Is that something that would run at the end of every cycle? I think checks would be helpful too for the grid2obs and precip stat files too.

DavidHuber-NOAA commented 1 month ago

@malloryprow Yes, it would run at the end of every cycle after all jobs for a particular RUN (gdas, gfs) are complete. It would only be executed if METplus was part of that RUN. I agree, it would be helpful for the other METplus products.

malloryprow commented 1 month ago

I think it may be better to include in copy_stat_files.py. The metp task don't produce anything for the gdas cycles or every gfs cycle.

DavidHuber-NOAA commented 1 month ago

OK, noted. I could code it so that it only runs for gfs cycles. Is it predictable which gfs cycles will produce data?

malloryprow commented 1 month ago

Having to dust some cobwebs off my brain here since it has been a good while since I have ran anything with the global workflow...if only one gfs cycle is being run it will run for that one every time, if more than one it should all run at the 00Z cycle. Kind of weird but when I first put the gfsmetp tasks in this is how VSDB was set up to run so I copied that "style".

DavidHuber-NOAA commented 1 month ago

Interesting. In that case, I agree that putting something in the XML would be quite challenging and instead placing a check in copy_stat_files.py makes sense. A failed status could then be issued if desired.

malloryprow commented 1 month ago

@RussTreadon-NOAA Would you be able to copy /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/get_data_files.py on Hera into the needed location, and then rerun the gfsmetp tasks in the configuration it was when it failed?

RussTreadon-NOAA commented 1 month ago

@malloryprow , I no longer have a g-w configuration which fails. My local copy of g-w contains the changes @DavidHuber-NOAA recommended.

malloryprow commented 1 month ago

Ah okay. The changes I made worked when running standalone but wanted confirmation that the desired behavior also happens in the global workflow.

CatherineThomas-NOAA commented 1 month ago

@malloryprow I still have my test that I can set to run another few cycles. I'll copy that file and let you know how it goes.

CatherineThomas-NOAA commented 1 month ago

@malloryprow: Is it get_data_files.py or copy_stat_files.py that I should be copying? The file listed is the same as the version in my clone.

RussTreadon-NOAA commented 1 month ago

@malloryprow . I broke my configuration by reverting @DavidHuber-NOAA 's changes. I reran the 2024050500 gfsmetpg2g1. Warning messages were printed but the job still completed with error code 0.

I then realized that your comment refers to /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/get_data_files.py, not copy_stats_files.py.

I copied /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stats_files.py to my HOMEgfs and reran 2024050500 gfsmetpg2g1. Interestingly, the job still finished with error code 0 even though warning messages were written to the log file

+ exgrid2grid_step1.sh[106]: python /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py
BEGIN: copy_stat_files.py
**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1632198/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1632198/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/pres/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1632198/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/sfc/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

END: copy_stat_files.py
+ exgrid2grid_step1.sh[107]: status=0

I looked at the changes you made to copy_stat_files.py.

Your modified script checks the destination directory for stat files. Since my previous run with Dave's fixes worked, the stat files are already in the destination directory. I moved the existing stat files to a new name and reran the job. The gfsmetpg2g1 still finished with error code 0

I looked again at your modified copy_stat_files.py. It uses a wildcard to check for any stat files in the destination directory. I failed to move all the 20240505 stat files in the destination directory. I did this and tried one more time.

This time gfsmetpg2g1 failed with

+ exgrid2grid_step1.sh[106]: python /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py
BEGIN: copy_stat_files.py
**************************************************
** WARNING: /scratch1/NCEPDEV/stmp2/role.jedipara/RUNDIRS/prjedi/metpg2g1.1174486/grid2grid_step1/metplus_output/gather_by_VSDB/stat_analysis/anom/prjedi/prjedi_2024050500.stat was not generated or zero size
**************************************************

ERROR: No stat files matching /scratch1/NCEPDEV/da/role.jedipara/archive/prjedi/../metplus_data/by_VSDB/*/*/*Z/prjedi/prjedi_20240505.stat
+ exgrid2grid_step1.sh[107]: status=1

We should change the logic in copy_stat_files.py to check for each expected output file not just any file. This complicates your scripting but ensures nothing slips through the cracks.

@DavidHuber-NOAA recommends a change to verif-global.fd/ush/run_verif_global_in_global_workflow.sh. Do you have this change?

malloryprow commented 1 month ago

@malloryprow: Is it get_data_files.py or copy_stat_files.py that I should be copying? The file listed is the same as the version in my clone.

Ah sorry @CatherineThomas-NOAA. You're right it should be copy_stat_files.py

malloryprow commented 1 month ago

@RussTreadon-NOAA Thanks for testing. I see what you are saying. Let me add that in and update you.

And yes I see

@RussTreadon-NOAA Interesting. The issue appears to reside in verif-global.fd/ush/run_verif_global_in_global_workflow.sh, line 252:

export nproc=${npe_node_metp_gfs:-1} This should probably be looking for npe_metp rather than npe_node_metp. Also, this should probably be a variable exported in the env files rather than cross-referencing a resource variable from the global-workflow. I will make updates to verif-global and the global-workflow then test this.

malloryprow commented 1 month ago

@RussTreadon-NOAA @CatherineThomas-NOAA

/scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stats_files.py updated! Apologies for stating the wrong file earlier.

RussTreadon-NOAA commented 1 month ago

@malloryprow , your updated /scratch1/NCEPDEV/global/Mallory.Row/VRFY/EMC_verif-global/ush/copy_stat_files.py (no "s" after "stat") fails with

Traceback (most recent call last):
  File "/scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd/ush/copy_stat_files.py", line 1\
35, in <module>
    sys.exit(1)
    ^^^
NameError: name 'sys' is not defined

I added import sys after import glob at the top of copy_stat_files.py. A reran successfully detected one missing stat file in my test.

malloryprow commented 1 month ago

Ah! Fixed that. Glad it detected the missing file.

RussTreadon-NOAA commented 1 month ago

@malloryprow , the local modifications to EMC-verif made in light of the above discussion are on Hera in /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/sorc/verif-global.fd. Two files are modified

ush/copy_stat_files.py - improved error handling for missing output stat files
ush/run_verif_global_in_global_workflow.sh - remove nproc.

@DavidHuber-NOAA , the change to ush/run_verif_global_in_global_workflow.sh requires the addition of nproc in a new metp section of g-w env/${machine}.env files. An example of the change is in env/HERA.env in /scratch1/NCEPDEV/da/role.jedipara/git/global-workflow/test_hera/env

malloryprow commented 1 month ago

Thanks @RussTreadon-NOAA! I got the changes into the develop branch of EMC_verif-global at https://github.com/NOAA-EMC/EMC_verif-global/commit/711837108c8ea48e31c8225c2edb2af82a032200.

@KateFriedman-NOAA updated the hash for EMC_verif-global :) The new hash is different from what I linked because it includes changes for running EMC_verif-global standalone on Orion following the Rocky upgrade. The hash has the needed changes.

DavidHuber-NOAA commented 1 month ago

@malloryprow I left a comment in the commit you referenced.

NOAA-EMC / global-workflow