Closed guillaumevernieres closed 1 month ago
FYI @JessicaMeixner-NOAA and @CatherineThomas-NOAA
File /work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/20210324.090000.ufs.cpld.cpl.r.nc
still exists.
C48mx500_3DVarAOWCDA fails on Hercules due to the presence of this file.
@KateFriedman-NOAA or @WalterKolczynski-NOAA : Can file 20210324.090000.ufs.cpld.cpl.r.nc
be removed from v
?
File
/work/noaa/global/glopara/data/ICSDIR/C48mx500/gdas.20210324/06/model_data/med/restart/20210324.090000.ufs.cpld.cpl.r.nc
still exists.C48mx500_3DVarAOWCDA fails on Hercules due to the presence of this file.
@KateFriedman-NOAA or @WalterKolczynski-NOAA : Can file
20210324.090000.ufs.cpld.cpl.r.nc
be removed fromv
?
FYI @RussTreadon-NOAA , you won't be able to run successfully the marine DA tasks on hercules/orion. Some of the changes that bring us closer to that goal are here: https://github.com/NOAA-EMC/global-workflow/pull/2749 but even then, it requires to manualy adjust a yaml to point to our experimental obs.
@guillaumevernieres Question about the mediator restart 20210324.090000.ufs.cpld.cpl.r.nc
file for the C48mx500_3DVarAOWCDA
CI test...I am testing an revamped staging job and it's failing because that file is missing (because it's renamed with ".NO" at the end. Should I disable the mediator for now or is a change coming to resolve this?
See lines 108-110 in this snippet from my new staging job yaml file for what the job is looking for when DO_OCN=YES
:
22 {% set r_prefix = model_start_date_current_cycle | to_YMD + "." + model_start_date_current_cycle | strftime("%H") + "0000" %}
...
94 {% if DO_OCN %}
95 ocean:
96 mkdir:
97 - "{{ COMOUT_OCEAN_RESTART_PREV }}"
98 copy:
99 - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.MOM.res.nc", "{{ COMOUT_OCEAN_RESTART_PREV }}"]
100 {% if OCNRES == "025" %}
101 {% for nn in range(1, 3) %}
102 - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.MOM.res_{{ nn }}.nc", "{{ COMOUT_OCEAN_RESTART_PREV }}"]
103 {% endfor %}
104 {% endif %}
105 {% if REPLAY_ICS == "YES" %}
106 - ["{{ ICSDIR }}/{{ COMOUT_OCEAN_ANALYSIS | relpath(ROTDIR) }}/{{ r_prefix }}.mom6_perturbation.nc", "{{ COMOUT_OCEAN_ANALYSIS }}/mom6_increment.nc"]
107 {% endif %}
108 {% if EXP_WARM_START == True %}
109 - ["{{ ICSDIR }}/{{ COMOUT_MED_RESTART_PREV | relpath(ROTDIR) }}/{{ r_prefix }}.ufs.cpld.cpl.r.nc", "{{ COMOUT_MED_RESTART_PREV }}"]
110 {% endif %}
111 {% endif %}
@KateFriedman-NOAA , the mediator file should be optional and I assume your refactoring should probably keep that functionality. Is there an option to no abort when syncing the file handler?
@guillaumevernieres Optional, got it, thanks!
The WCDA g-w CI test failed on WCOSS2 (Cactus) during the gdasmarinebmat job with the traceback
nid003046.cactus.wcoss2.ncep.noaa.gov 0: MOM_in domain decomposition
whalo = 2, ehalo = 2, shalo = 2, nhalo = 2
X-AXIS = 9 9 9 9
Y-AXIS = 5 4 4 4
nid003046.cactus.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: MOM_restart: MOM run restarted using : INPUT/MOM.res.nc
nid003046.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE 0: NetCDF: Variable not found: variable_att_exists: file:INPUT/MOM.res.nc- variable:
nid003046.cactus.wcoss2.ncep.noaa.gov 0:
FATAL from PE 0: NetCDF: Variable not found: variable_att_exists: file:INPUT/MOM.res.nc- variable:
nid003046.cactus.wcoss2.ncep.noaa.gov 0: Image PC Routine Line Source
libifcoremt.so.5 000014CBAC47FD4A tracebackqq_ Unknown Unknown
libsoca.so 000014CBCA84BBBE mpp_mod_mp_mpp_er 72 mpp_util_mpi.inc
libsoca.so 000014CBCABCAD52 fms_io_utils_mod_ 190 fms_io_utils.F90
libsoca.so 000014CBCA76F443 netcdf_io_mod_mp_ 381 netcdf_io.F90
libsoca.so 000014CBCA76F4E5 netcdf_io_mod_mp_ 465 netcdf_io.F90
libsoca.so 000014CBCA7A04E8 netcdf_io_mod_mp_ 1187 netcdf_io.F90
libsoca.so 000014CBCB7F2528 mom_io_infra_mp_g 530 MOM_io_infra.F90
libsoca.so 000014CBCB0A64AF mom_io_file_mp_ge 1230 MOM_io_file.F90
libsoca.so 000014CBCB0C1CA7 mom_restart_mp_re 1633 MOM_restart.F90
libsoca.so 000014CBCB1CCA91 mom_state_initial 538 MOM_state_initialization.F90
libsoca.so 000014CBCAD05B7B mom_mp_initialize 2961 MOM.F90
libsoca.so 000014CBCA65096E Unknown Unknown Unknown
I am running g-w built from g-w PR #2833. The AERO and UFSDA g-w CI run to completion. WCDA aborts as shown above. The log file with the failure is /lfs/h2/emc/da/noscrub/russ.treadon/COMROOT/prwcda/logs/2021032418/gdasmarinebmat.log
on Cactus.
Two questions
develop
? Regarding question 1, the marine bmat task recently had updates (for refactoring) in global-workflow, and I am encountering some bugs that are flushed out when trying to run with an ensemble (on Hera), but it's not clear to me why what you're seeing here would be confined to WCOSS.
@AndrewEichmann-NOAA , I have only run g-w CI on WCOSS2 (Cactus). I do not know if g-w WCDA CI runs on other machines. I found that env/WCOSS2.env
does not contain entries for marine jobs. I added these entries in PR #2833. The fact that these entries are not in develop env/WCOSS2.env
makes me wonder if we ready to run g-w WCDA CI on WCOSS2
@CatherineThomas-NOAA @RussTreadon-NOAA I'll have to dig deeper into this but I have been running the WCDA CI on Hera successfully, though it's possible that updating will catch something
@AndrewEichmann-NOAA , g-w WCDA CI works on Hera. I set up it this morning. All jobs successfully ran to completion
Hera(hfe05):/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/prwcda$ rocotostat -d prwcda.db -w prwcda.xml -c all -s
CYCLE STATE ACTIVATED DEACTIVATED
202103241200 Done Aug 15 2024 13:36:19 Aug 15 2024 13:50:24
202103241800 Done Aug 15 2024 13:36:19 Aug 15 2024 14:50:22
ci/cases/pr/C48mx500_3DVarAOWCDA.yaml
from develop
at 336b78a has
skip_ci_on_hosts:
- wcoss2
- gaea
- orion
- hercules
I should not try running g-w WCDA CI on WCOSS2. I should stick to Hera.
@RussTreadon-NOAA @AndrewEichmann-NOAA The last that I heard about the WCDA test on WCOSS2 was that the C++ issue was resolved and that there was a push to get all the needed files on the machine. That conversation predates the discovery of the problems with the v17 cycling prototypes which took most of @guillaumevernieres's attention before he went on leave. I don't think everything's been sorted yet.
The original request in this issue has been completed. Please open new issues to address any related needs discussed above. Closing as complete.
What is wrong?
The mediator restart on glopara Hera for the C48mx500 test case was suffixed with
.NO
to insure that the model would not make use of it. The same thing needs to be done on Orion and WCOSS.What should have happened?
On Orion, in the glopara directory
there are 2 files:
20210324.090000.ufs.cpld.cpl.r.nc
and20210324.090000.ufs.cpld.cpl.r.nc.NO
20210324.090000.ufs.cpld.cpl.r.nc
needs to be deleted.What machines are impacted?
WCOSS2, Orion, Hercules
Steps to reproduce
N/A
Additional information
N/A
Do you have a proposed solution?
No response