NOAA-EMC / AQM

GNU General Public License v3.0
3 stars 18 forks source link

nexus_gfs_sfc jobs failed on Cactus after the dev machine was switched today #66

Closed JianpingHuang-NOAA closed 1 year ago

JianpingHuang-NOAA commented 1 year ago

@chan-hoo @bbakernoaa

I checked out the latest workflow on Cactus after the dev machine was switched today, and noticed that the job, nexus_gfs_sfs_job failed and jinja2 module was not loaded correctly.

Below is the error message

Launching J-job (jjob_fp) for task "nexus_gfs_sfc" ... jjob_fp = "/lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v7.0.55/jobs/JREGIONAL_NEXUS_GFS_SFC"

Traceback (most recent call last): File "/lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v7.0.55/ush/config_utils.py", line 11, in from python_utils import cfg_main File "/lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v7.0.55/ush/python_utils/init.py", line 31, in from .config_parser import ( File "/lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v7.0.55/ush/python_utils/config_parser.py", line 38, in import jinja2 ModuleNotFoundError: No module named 'jinja2'

More details can be found from the log example file,

/lfs/h2/emc/ptmp/jianping.huang/emc.para/output/20220904/nexus_gfs_sfc_2022090418.id_1677709775.log.0

Thanks,

Jianping

chan-hoo commented 1 year ago

@JianpingHuang-NOAA, it is strange. 'jinja2' is not necessary for nexus_gfs_sfc. The workflow doesn't load jinja2 for this task. In my test run, it worked well although another task failed due to missing data files:

      CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202302280000           nexus_gfs_sfc                    48259072           SUCCEEDED                   0         1          18.0
202302280000       nexus_emission_00                    48259782             RUNNING                   -         0           0.0
202302280000       nexus_emission_01                    48259783             RUNNING                   -         0           0.0
202302280000       nexus_emission_02                    48259785             RUNNING                   -         0           0.0
202302280000        nexus_post_split                           -                   -                   -         -             -
202302280000           fire_emission                    48259799                DEAD                   1         2          33.0
202302280000            point_source                    48259074           SUCCEEDED                   0         1         266.0
202302280000           get_extrn_ics                    48259075           SUCCEEDED                   0         1          19.0
202302280000          get_extrn_lbcs                    48259076           SUCCEEDED                   0         1          22.0
JianpingHuang-NOAA commented 1 year ago

@chan-hoo I set up the near real-time runs with the latest workflow. It runs successfully as what you are seeing.

   CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION

================================================================================================================================ 202303010000 nexus_gfs_sfc 48335129 SUCCEEDED 0 1 36.0 202303010000 nexus_emission_00 48335482 RUNNING - 0 0.0 202303010000 nexus_emission_01 48335483 RUNNING - 0 0.0 202303010000 nexus_emission_02 48335485 RUNNING - 0 0.0 202303010000 nexus_post_split - - - - - 202303010000 fire_emission 48335131 SUCCEEDED 0 1 48.0 202303010000 point_source 48335132 SUCCEEDED 0 1 275.0 202303010000 get_extrn_ics 48335133 SUCCEEDED 0 1 41.0 202303010000 get_extrn_lbcs 48335134 SUCCEEDED 0 1 40.0

Please see more details from /lfs/h2/emc/aqmtemp/para/c55/20230301/log.launch_FV3LAM_wflow

However, the problem persists when I run the retro runs. You will see a similar issue likely if you set "DO_REAL_TIME: false" in your config.yaml.

chan-hoo commented 1 year ago

Got it. I'll take a look at it.

chan-hoo commented 1 year ago

@JianpingHuang-NOAA, I've added modulefiles/tasks/wcoss2/nexus_gfs_sfc.local.lua to the online-cmaq branch. It looks working well. I am not sure why this modulefile is necessary because it worked well on Dogwood and previous Cactus. Anyhow, the switched Cactus requires this file. Please let me know if you have any problems.

JianpingHuang-NOAA commented 1 year ago

@chan-hoo The updated workflow works for retro runs now. Thanks again for the help. I am going to close the ticket