dtcenter / METplus

Python scripting infrastructure for MET tools.
https://metplus.readthedocs.io
Apache License 2.0
97 stars 37 forks source link

Bugfix: METplus GridStat can't find OBS file using template and time range #2413

Closed MarcelCaron-NOAA closed 10 months ago

MarcelCaron-NOAA commented 10 months ago

Replace italics below with details for this issue.

Describe the Problem

A METplus error occurs when this EVS/cam job is run on NCEP's WCOSS2 system in one environment (named "test_ecflow_env" for testing here). However, the same job is successful when run in a different environment ("test_parallel_env"). In the first case, METplus is unable to find a file that exists when checked manually. Could you help us debug?

Attached are copies of (1) the stdout log for the EVS job that failed (2) the METplus log file of the failed METplus run, and (3) two test scripts, described below and in the "To Reproduce" section. EVS_fail_job.txt METplus_fail_log.txt test_ecflow_env.txt test_parallel_env.txt

Here is an excerpt from the stdout log for the EVS job that included that METplus run:

5996 0 + '[' -s /lfs/h2/emc/vpppg/noscrub/emc.vpppg/com/evs/v1.0/prep/cam/mrms.20231107/conus/EchoTop18_00.50_20231107-150000.G227.nc ']'
...
 6008 0 + obs_found=1

The above excerpt is a section of our code that checks for the presence of an observation file that will be used in METplus, a check which was successful in this case. To confirm, a list command for the file showed that the file does exist on the system (WCOSS2/Dogwood).

However, the METplus run that followed the excerpted code outputs an error. The associated METplus log is attached, but here is an excerpt:

  32 11/08 18:36:50.387 metplus (command_builder.py:694) DEBUG: Looking for OBS_INPUT files under /lfs/h2/emc/vpppg/noscrub/emc.vpppg/com/evs/v1.0/prep/cam within range [-300,300] using template mrms.{valid?fmt=%Y%m%d}/conus/EchoTop18_00.50_{valid?fmt=%Y%m%d}-{valid?fmt=%H}0000.G227.nc
  33 11/08 18:36:50.792 metplus (command_builder.py:257) ERROR: (command_builder.py:741) Could not find OBS_INPUT files under /lfs/h2/emc/vpppg/noscrub/emc.vpppg/com/evs/v1.0/prep/cam within range [-300,300] using template mrms.{valid?fmt=%Y%m%d}/conus/EchoTop18_00.50_{valid?fmt=%Y%m%d}-{valid?fmt=%H}0000.G227.nc
  34 11/08 18:36:50.793 metplus (command_builder.py:257) ERROR: (command_builder.py:490) Could not find observation file

The valid time in this case is 2023110715, so METplus should have been able to find the file.

We also confirmed that (1) the observation file was already available before METplus was run, at about 11/08 18:15Z and (2) the valid time stored in the observation file is 11/07/2023 15:00:38Z, which should have satisfied the matching condition in METplus.

I've set up two test cases ("test_ecflow_env" and "test_parallel_env") with slightly different environments, one in which the error described here can be replicated, the other in which the same METplus run is successful.

I'm wondering if any part of our configuration jumps out to you as a possible cause of the issue?

Expected Behavior

The observation file should satisfy the METplus check for files that match the template within the valid hour range (11/07/2023 15Z +/- 5 mins). That observation file should be used to complete the METplus run successfully (the forecast file having been already found successfully). In other words, the "test_parallel_env" test results represent the expected behavior.

Environment

Describe your runtime environment:

  1. Machine: WCOSS2, Dogwood
  2. OS: Linux
  3. Here are the modules loaded into the environment from which METplus was run in both test cases:
    1) craype-x86-rome     (H)  11) ve/evs/1.0         21) hdf5/1.10.6
    2) libfabric/1.11.0.0. (H)  12) cray-mpich/8.1.19  22) netcdf/4.7.4
    3) craype-network-ofi  (H)  13) cray-pals/1.2.2    23) nco/4.9.7
    4) envvar/1.0               14) cfp/2.0.4          24) prod_util/2.0.13
    5) prod_envir/2.0.6         15) libjpeg/9c         25) cdo/1.9.8
    6) intel/19.1.3.304         16) libpng/1.6.37      26) grib_util/1.2.4
    7) PrgEnv-intel/8.3.3       17) zlib/1.2.11        27) wgrib2/2.0.8
    8) craype/2.7.17            18) jasper/2.0.25      28) met/11.0.2
    9) geos/3.8.1               19) udunits/2.2.28     29) metplus/5.0.1
    10) proj/7.1.0               20) gsl/2.7

To Reproduce

Describe the steps to reproduce the behavior: Two very similar tests may help isolate the problem. The first fails and the second succeeds. To test: 1) Log onto WCOSS2 Dogwood (prod) and cd into your work space (a test directory on ptmp or stmp will work) 2) Transfer the following files (attached here) from your local computer to your work space on WCOSS2 (Dogwood): test_ecflow_env.txt test_parallel_env.txt 3) Convert the files to bash scripts: cp test_ecflow_env.txt test_ecflow_env.sh cp test_parallel_env.txt test_parallel_env.sh 4) Make sure the files are user-executable: chmod u+x test_ecflow_env.sh chmod u+x test_parallelenv.sh 5) Open each test file (vi test*_env.sh) and edit the section below #### EDIT BELOW FOR TESTING #### as needed 6) Run each test script, redirecting the output test_ecflow_env.sh &> test1.out test_parallel_env.sh &> test2.out (Note: these will write two data directories for METplus to use 7) test_ecflow_env should output errors, failing to find the output file. test_parallel_env should succeed.

If testing this on WCOSS2/Dogwood (currently the production machine) is not possible, or if you'd prefer the test data is provided here, please let me know and I can ask someone to copy it over. Thanks

Relevant Deadlines

The relevant EVS code will be delivered at COB 11/17/2023.

Funding Source

NOAA/NCEP/EMC/VPPPG

Define the Metadata

Assignee

Labels

Projects and Milestone

Define Related Issue(s)

Consider the impact to the other METplus components.

Bugfix Checklist

See the METplus Workflow for details.