NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
75 stars 167 forks source link

Append .NO suffix to the mediator restart for the C48mx500 test case on Orion/Hercules and WCOSS #2769

Closed guillaumevernieres closed 1 month ago

guillaumevernieres commented 3 months ago

What is wrong?

The mediator restart on glopara Hera for the C48mx500 test case was suffixed with .NO to insure that the model would not make use of it. The same thing needs to be done on Orion and WCOSS.

What should have happened?

On Orion, in the glopara directory

/work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/

there are 2 files: 20210324.090000.ufs.cpld.cpl.r.nc and 20210324.090000.ufs.cpld.cpl.r.nc.NO 20210324.090000.ufs.cpld.cpl.r.nc needs to be deleted.

What machines are impacted?

WCOSS2, Orion, Hercules

Steps to reproduce

N/A

Additional information

N/A

Do you have a proposed solution?

No response

guillaumevernieres commented 3 months ago

FYI @JessicaMeixner-NOAA and @CatherineThomas-NOAA

RussTreadon-NOAA commented 3 months ago

File /work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/20210324.090000.ufs.cpld.cpl.r.nc still exists.

C48mx500_3DVarAOWCDA fails on Hercules due to the presence of this file.

@KateFriedman-NOAA or @WalterKolczynski-NOAA : Can file 20210324.090000.ufs.cpld.cpl.r.nc be removed from v?

guillaumevernieres commented 3 months ago

File /work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/20210324.090000.ufs.cpld.cpl.r.nc still exists.

C48mx500_3DVarAOWCDA fails on Hercules due to the presence of this file.

@KateFriedman-NOAA or @WalterKolczynski-NOAA : Can file 20210324.090000.ufs.cpld.cpl.r.nc be removed from v?

FYI @RussTreadon-NOAA , you won't be able to run successfully the marine DA tasks on hercules/orion. Some of the changes that bring us closer to that goal are here: https://github.com/NOAA-EMC/global-workflow/pull/2749 but even then, it requires to manualy adjust a yaml to point to our experimental obs.

KateFriedman-NOAA commented 2 months ago

@guillaumevernieres Question about the mediator restart 20210324.090000.ufs.cpld.cpl.r.nc file for the C48mx500_3DVarAOWCDA CI test...I am testing an revamped staging job and it's failing because that file is missing (because it's renamed with ".NO" at the end. Should I disable the mediator for now or is a change coming to resolve this?

See lines 108-110 in this snippet from my new staging job yaml file for what the job is looking for when DO_OCN=YES:

 22 {% set r_prefix = model_start_date_current_cycle | to_YMD + "." + model_start_date_current_cycle | strftime("%H") + "0000" %}
...
 94 {% if DO_OCN %}
 95 ocean:
 96     mkdir:
 97         - "{{ COMOUT_OCEAN_RESTART_PREV }}"
 98     copy:
 99         - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.MOM.res.nc", "{{ COMOUT_OCEAN_RESTART_PREV }}"]
100         {% if OCNRES == "025" %}
101             {% for nn in range(1, 3) %}
102         - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.MOM.res_{{ nn }}.nc", "{{ COMOUT_OCEAN_RESTART_PREV }}"]
103             {% endfor %}
104         {% endif %}
105         {% if REPLAY_ICS == "YES" %}
106         - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_ANALYSIS | relpath(ROTDIR) }}/{{ r_prefix }}.mom6_perturbation.nc", "{{ COMOUT_OCEAN_ANALYSIS }}/mom6_increment.nc"]
107         {% endif %}
108         {% if EXP_WARM_START == True %}
109         - ["{{ ICSDIR }}/{{ COMOUT_MED_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.ufs.cpld.cpl.r.nc", "{{ COMOUT_MED_RESTART_PREV }}"]
110         {% endif %}
111 {% endif %}
guillaumevernieres commented 2 months ago

@KateFriedman-NOAA , the mediator file should be optional and I assume your refactoring should probably keep that functionality. Is there an option to no abort when syncing the file handler?

KateFriedman-NOAA commented 2 months ago

@guillaumevernieres Optional, got it, thanks!

RussTreadon-NOAA commented 2 months ago

The WCDA g-w CI test failed on WCOSS2 (Cactus) during the gdasmarinebmat job with the traceback

nid003046.cactus.wcoss2.ncep.noaa.gov 0:  MOM_in domain decomposition
whalo =    2, ehalo =    2, shalo =    2, nhalo =    2
  X-AXIS =    9   9   9   9
  Y-AXIS =    5   4   4   4
nid003046.cactus.wcoss2.ncep.noaa.gov 0: NOTE from PE     0: MOM_restart: MOM run restarted using : INPUT/MOM.res.nc
nid003046.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Variable not found: variable_att_exists: file:INPUT/MOM.res.nc- variable:

nid003046.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE     0: NetCDF: Variable not found: variable_att_exists: file:INPUT/MOM.res.nc- variable:

nid003046.cactus.wcoss2.ncep.noaa.gov 0: Image              PC                Routine            Line        Source
libifcoremt.so.5   000014CBAC47FD4A  tracebackqq_          Unknown  Unknown
libsoca.so         000014CBCA84BBBE  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
libsoca.so         000014CBCABCAD52  fms_io_utils_mod_         190  fms_io_utils.F90
libsoca.so         000014CBCA76F443  netcdf_io_mod_mp_         381  netcdf_io.F90
libsoca.so         000014CBCA76F4E5  netcdf_io_mod_mp_         465  netcdf_io.F90
libsoca.so         000014CBCA7A04E8  netcdf_io_mod_mp_        1187  netcdf_io.F90
libsoca.so         000014CBCB7F2528  mom_io_infra_mp_g         530  MOM_io_infra.F90
libsoca.so         000014CBCB0A64AF  mom_io_file_mp_ge        1230  MOM_io_file.F90
libsoca.so         000014CBCB0C1CA7  mom_restart_mp_re        1633  MOM_restart.F90
libsoca.so         000014CBCB1CCA91  mom_state_initial         538  MOM_state_initialization.F90
libsoca.so         000014CBCAD05B7B  mom_mp_initialize        2961  MOM.F90
libsoca.so         000014CBCA65096E  Unknown               Unknown  Unknown

I am running g-w built from g-w PR #2833. The AERO and UFSDA g-w CI run to completion. WCDA aborts as shown above. The log file with the failure is /lfs/h2/emc/da/noscrub/russ.treadon/COMROOT/prwcda/logs/2021032418/gdasmarinebmat.log on Cactus.

Two questions

  1. Is the gdasmarinebmat failure I am seeing related to this issue or should a new issue be opened?
  2. Have we successfully run WCDA g-w CI on Cactus using the current head, 336b78a, of g-w develop?
AndrewEichmann-NOAA commented 2 months ago

Regarding question 1, the marine bmat task recently had updates (for refactoring) in global-workflow, and I am encountering some bugs that are flushed out when trying to run with an ensemble (on Hera), but it's not clear to me why what you're seeing here would be confined to WCOSS.

CatherineThomas-NOAA commented 2 months ago

@AndrewEichmann-NOAA: Do you think this could be related to Issue #2797? The MOM_input file was updated, but only for the high resolution. Does it need to be updated for lower res as well? Counterpoint to this is that it should fail on Hera as well and the WCDA test passed for PR 2751.

RussTreadon-NOAA commented 2 months ago

@AndrewEichmann-NOAA , I have only run g-w CI on WCOSS2 (Cactus). I do not know if g-w WCDA CI runs on other machines. I found that env/WCOSS2.env does not contain entries for marine jobs. I added these entries in PR #2833. The fact that these entries are not in develop env/WCOSS2.env makes me wonder if we ready to run g-w WCDA CI on WCOSS2

AndrewEichmann-NOAA commented 2 months ago

@CatherineThomas-NOAA @RussTreadon-NOAA I'll have to dig deeper into this but I have been running the WCDA CI on Hera successfully, though it's possible that updating will catch something

RussTreadon-NOAA commented 2 months ago

@AndrewEichmann-NOAA , g-w WCDA CI works on Hera. I set up it this morning. All jobs successfully ran to completion

Hera(hfe05):/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prwcda$ rocotostat -d prwcda.db -w prwcda.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Aug 15 2024 13:36:19    Aug 15 2024 13:50:24
202103241800        Done    Aug 15 2024 13:36:19    Aug 15 2024 14:50:22

ci/cases/pr/C48mx500_3DVarAOWCDA.yaml from develop at 336b78a has

skip_ci_on_hosts:
  - wcoss2
  - gaea
  - orion
  - hercules

I should not try running g-w WCDA CI on WCOSS2. I should stick to Hera.

CatherineThomas-NOAA commented 2 months ago

@RussTreadon-NOAA @AndrewEichmann-NOAA The last that I heard about the WCDA test on WCOSS2 was that the C++ issue was resolved and that there was a push to get all the needed files on the machine. That conversation predates the discovery of the problems with the v17 cycling prototypes which took most of @guillaumevernieres's attention before he went on leave. I don't think everything's been sorted yet.

KateFriedman-NOAA commented 1 month ago

The original request in this issue has been completed. Please open new issues to address any related needs discussed above. Closing as complete.