Closed RussTreadon-NOAA closed 1 day ago
@WalterKolczynski-NOAA , anything I can do to help move this PR forward?
@RussTreadon-NOAA Tests are failing on Hera due to full stmp disks. We need to be able to run the tests. Any help from developers in managing the shared space is greatly appreciated.
@RussTreadon-NOAA Tests are failing on Hera due to full stmp disks. We need to be able to run the tests. Any help from developers in managing the shared space is greatly appreciated.
I reduced the Hera stmp footprint for role.jedipara and my account. Mary has been sending Hera over quota emails. EMC management could send an email. While email brings the problem to everyone's attention, it alone doesn't reduce usages. As you say, developers need to free up stmp space.
@WalterKolczynski-NOAA and @aerorahul: This morning role.jedipara
successfully completed g-w CI for
pr2700_ufsda
pr2700_gsida
pr2700_aero
pr2700_wcda
Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_ufsda$ rocotostat -d pr2700_ufsda.db -w pr2700_ufsda.xml -c all -s
CYCLE STATE ACTIVATED DEACTIVATED
202402231800 Done Jun 25 2024 12:00:19 Jun 25 2024 12:25:21
202402240000 Done Jun 25 2024 12:00:19 Jun 25 2024 15:25:16
Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_gsida$ rocotostat -d pr2700_gsida.db -w pr2700_gsida.xml -c all -s CYCLE STATE ACTIVATED DEACTIVATED 202112201800 Done Jun 25 2024 12:00:22 Jun 25 2024 12:25:23 202112210000 Done Jun 25 2024 12:00:22 Jun 25 2024 15:25:18 202112210600 Done Jun 25 2024 12:00:22 Jun 25 2024 15:25:18
Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_aero$ rocotostat -d pr2700_aero.db -w pr2700_aero.xml -c all -s CYCLE STATE ACTIVATED DEACTIVATED 202112201200 Done Jun 25 2024 10:45:20 Jun 25 2024 11:05:14 202112201800 Done Jun 25 2024 10:45:20 Jun 25 2024 15:15:14 202112210000 Done Jun 25 2024 10:45:20 Jun 25 2024 15:25:13
Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_wcda$ rocotostat -d pr2700_wcda.db -w pr2700_wcda.xml -c all -s CYCLE STATE ACTIVATED DEACTIVATED 202103241200 Done Jun 25 2024 09:45:12 Jun 25 2024 10:00:34 202103241800 Done Jun 25 2024 09:45:12 Jun 25 2024 10:46:18
None of these CI parallels encountered disk quota problems on Hera this morning.
When the disk space returns to usable status, we will re-run the CI to ensure all tests (not just the DA tests) run to completion. It is unreasonable to ask you to run all the tests. It also defeats the purpose of developing automated testing. Thanks for your effort, diligence, and support.
COMROOT
, EXPDIR
, and RUNDIRS
have been removed from /scratch1/NCEPDEV/stmp2/role.jedipara
to free up additional Hera stmp space.
It would be good to document for EMC management the delay in g-w CI testing and PR merger caused by system and user issues (disk quota, heavy queue load, etc).
@WalterKolczynski-NOAA : should I be concerned about the message This branch cannot be rebased due to conflicts?
I thought we normally did a Squash and merge of PRs into develop
. Now I see that Rebase and merge is selected.
@WalterKolczynski-NOAA : should I be concerned about the message This branch cannot be rebased due to conflicts?
I thought we normally did a Squash and merge of PRs into
develop
. Now I see that Rebase and merge is selected.
We do squash and merge into develop, but updating your branch with develop does not (and should not). You can either rebase or standard merge, but either way any conflicts will need to be addressed.
@WalterKolczynski-NOAA . RussTreadon-NOAA:feature/rename_atm
is in sync with the current head of g-w develop
. This PR is ready for automated g-w CI.
We're still holding this until space clears so we can run on Hera. That's the only place the AOWCDA test currently runs.
Thank you @WalterKolczynski-NOAA for the update.
It is very unfortunate that a g-w PR is stuck because we can't find sufficient disk space on Hera. As this PR documents, C48mx500_3DVarAOWCDA has successfully run on Hera more than once. These successful tests ran in /scratch1/NCEPDEV/stmp2
. Does g-w CI have the ability to run in different Hera filesets?
Thank you @WalterKolczynski-NOAA for the update.
It is very unfortunate that a g-w PR is stuck because we can't find sufficient disk space on Hera. As this PR documents, C48mx500_3DVarAOWCDA has successfully run on Hera more than once. These successful tests ran in
/scratch1/NCEPDEV/stmp2
. Does g-w CI have the ability to run in different Hera filesets?
The user account running the CI should have access to this location to create DATA
.
AOWCDA heads up Set up g-w CI C48mx500_3DVarAOWCDA on Hera with
export PSLOT="pr2700_wcda"
export EXPDIR="/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/${PSLOT}"
export ROTDIR="/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/${PSLOT}"
...
export STMP="/scratch1/NCEPDEV/stmp2/${USER}"
export PTMP="/scratch1/NCEPDEV/stmp4/${USER}"
in config.base
. Jobs successfully ran up to 20210324 18Z gdasfcst. This job aborted with
21: (abort_ice)ABORTED:
21: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
20:
20: (abort_ice)ABORTED:
20: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20
Prior to the above, see
12: WARNING from PE 0: read_field_3d:The variable Temp has an unlimited dimension in INPUT/mom6_increment.nc but no time level is specified.
12:
12:
12: WARNING from PE 0: read_field_3d:The variable Temp has an unlimited dimension in INPUT/mom6_increment.nc but no time level is specified.
12:
12:
12: WARNING from PE 0: read_field_3d:The variable Salt has an unlimited dimension in INPUT/mom6_increment.nc but no time level is specified.
in /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/pr2700_wcda/logs/2021032418/gdasfcst.log
ncdump -hcs /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/pr2700_wcda/gdas.20210324/18/analysis/ocean/gdas.t18z.ocninc.nc
returns
dimensions:
xaxis_1 = 72 ;
yaxis_1 = 35 ;
zaxis_1 = 25 ;
Time = UNLIMITED ; // (1 currently)
variables:
double xaxis_1(xaxis_1) ;
xaxis_1:long_name = "xaxis_1" ;
xaxis_1:units = "none" ;
xaxis_1:cartesian_axis = "X" ;
double yaxis_1(yaxis_1) ;
yaxis_1:long_name = "yaxis_1" ;
yaxis_1:units = "none" ;
yaxis_1:cartesian_axis = "Y" ;
double zaxis_1(zaxis_1) ;
zaxis_1:long_name = "zaxis_1" ;
zaxis_1:units = "none" ;
zaxis_1:cartesian_axis = "Z" ;
double Time(Time) ;
Time:long_name = "Time" ;
Time:units = "time level" ;
Time:cartesian_axis = "T" ;
double Temp(Time, zaxis_1, yaxis_1, xaxis_1) ;
Temp:long_name = "Temp" ;
Temp:units = "none" ;
Temp:checksum = "7C68000000000000" ;
...
double h(Time, zaxis_1, yaxis_1, xaxis_1) ;
h:long_name = "h" ;
h:units = "none" ;
h:checksum = "7830C06173333B6C" ;
// global attributes:
:filename = ".//ocn.mom6_iau.incr.2021-03-24T15:00:00Z.nc" ;
:_Format = "64-bit offset" ;
}
@guillaumevernieres , is TIME = UNLIMITED
expected in ocninc.nc
Tagging @WalterKolczynski-NOAA for awareness
Oddity: While account_params
indicates that stmp2
and stmp4
are over quota,
Directory: /scratch1/NCEPDEV/stmp2 DiskInUse=724752 GB, Quota=700000 GB, Files=30194982, FileQUota=140000000
Directory: /scratch1/NCEPDEV/stmp4 DiskInUse=724752 GB, Quota=700000 GB, Files=30194982, FileQUota=140000000
there are no disk quota exceeded
messages in any of the pr2700_wcda log files. Are the numbers reported by account_params
accurate?
Hera stmp is now below the 100% threshold. Can we kick off the final testing now to get this over the finish line?
Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_72bf7d88/logs/2021032418/gdasfcst.log
Follow link here to view the contents of the above file(s): (link)
Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C48mx500_3DVarAOWCDA_72bf7d88
Experiment C96_atmaerosnowDA FAILED on Hera with error logs:
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gdasaeroanlrun.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gfsaeroanlrun.log
Follow link here to view the contents of the above file(s): (link)
Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C96_atmaerosnowDA_72bf7d88
C48mx500_3DVarAOWCDA FAILURE
A check of
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_72bf7d88/logs/2021032418/gdasfcst.log
shows the gdasfcst failure to be the same as reported above .
I do not know the reason for this failure. @JessicaMeixner-NOAA or @guillaumevernieres , have you seen this type of error before in the wcda system?
Here is model printout when the model aborted
21: (abort_ice)ABORTED:
21: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
20:
20: (abort_ice)ABORTED:
20: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20
C96_atmaerosnowDA FAILURE
A check of the log files
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gdasaeroanlrun.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gfsaeroanlrun.log
shows that both jobs failed for the same reason. An expected input fix file is not found
0: OOPS_STATS Run start - Runtime: 2.35 sec, Memory: total: 48.86 Gb, per task: min = 123.86 Mb, max = 141.50 Mb
0: Run: Starting oops::Variational<FV3JEDI, UFO and IODA observations>
0: OOPS_STATS Variational start - Runtime: 2.35 sec, Local Memory: 142.58 Mb
0:
0: FATAL from PE 0: get_ascii_file_num_lines: File ./fv3jedi/fmsmpp.nml does not exist.
0:
0:
0: FATAL from PE 0: get_ascii_file_num_lines: File ./fv3jedi/fmsmpp.nml does not exist.
0:
0: Image PC Routine Line Source
0: libifcoremt.so.5 000014D1B5A52DCB tracebackqq_ Unknown Unknown
0: libfms.so 000014D18E21AA2E mpp_mod_mp_mpp_er Unknown Unknown
A check of the run directories for both the gfs and gdas jobs shows only a single file in ./fv3jedi
/scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C96_atmaerosnowDA_72bf7d88/gdasaeroanl_00/fv3jedi:
total used in directory 12 available 511827864
drwxr-sr-x 2 Terry.McGuinness stmp 4096 Jun 27 21:39 .
drwxr-sr-x 9 Terry.McGuinness stmp 4096 Jun 27 21:46 ..
-rw-r--r-- 1 Terry.McGuinness stmp 2154 Jun 24 15:07 fv3jedi_fieldmetadata_restart.yaml
However, both the gfs and gdas aeroanlinit jobs indicate that the missing file was copied to the run directory. For example, gdasaeroanlinit.log contains
^[[38;21m2024-06-27 21:33:36,366 - INFO - file_utils : Copied /scratch1/NCEPDEV/global/CI/2700/gfs/fix/gdas/fv3jedi/fv3files/fmsmpp.nml to /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C96_atmaerosnowDA_72bf7d88/gdasaeroanl_00/fv3jedi/fmsmpp.nml^[[0m
ls -l
of the source file confirms that it exists
Hera(hfe05):/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100$ ls -l /scratch1/NCEPDEV/global/CI/2700/gfs/fix/gdas/fv3jedi/fv3files/fmsmpp.nml
-rw-r--r-- 1 role.glopara global 362 Jun 30 2022 /scratch1/NCEPDEV/global/CI/2700/gfs/fix/gdas/fv3jedi/fv3files/fmsmpp.nml
Interestingly g-w PR #2729 experienced the same failure on WCOSS2.
I can not explain this behavior at present.
@andytangborn , have you seen this in error in any of your aerosol tests?
C48mx500_3DVarAOWCDA FAILURE
A check of
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_72bf7d88/logs/2021032418/gdasfcst.log
shows the gdasfcst failure to be the same as reported above .
I do not know the reason for this failure. @JessicaMeixner-NOAA or @guillaumevernieres , have you seen this type of error before in the wcda system?
Here is model printout when the model aborted
21: (abort_ice)ABORTED: 21: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice) 21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21 20: 20: (abort_ice)ABORTED: 20: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice) 20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20
@RussTreadon-NOAA , I don't think cice is the issue, the problem starts with a NaN in Tsfc. I'll do some digging, but I first need to figure out how to do x11 forwarding with my gfe ...
I only turned the C96_atmaerosnowDA test on yesterday (#2720), but it passed then, both when I ran it manually and during the CI process. Given it is now failing across the board in multiple PRs on WCOSS, my instinct would be that something outside of global-workflow changed.
Edit: or possibly a PR that was merged independently around the same time that didn't have each other's changes.
I, too, successfully ran C96_atmaerosnowDA under role.jedipara and my user account on Hera in the past. The fact that C96_atmaerosnowDA fails with both an updated gdas.cd
hash (this PR) and an old gdas.cd
hash (PR #2729) suggests that the root cause for the failure may lay outside of GDASApp.
Stronger evidence is the fact that PR #2729 and #2720 use the same gdas.cd
hash. Running C96_atmaerosnowDA with #2720 passed. Running C96_atmaerosnowDA with #2729 failed. Since both PRs use the same gdas.cd
hash this suggests the root cause is outside GDASApp.
@RussTreadon-NOAA I have not seen that negative seaice area issue before. @guillaumevernieres - let me know how I can help looking into this issue more.
I believe #2719 needs to be reverted.
Consider the case above C96_atmaerosnowDA_72bf7d88
.
The 2021122100 gfsaeroanlinit
and gdasaeroanlinit
logs are timestamped 21:33. The gfsaeroanlrun
and gdasaeroanlrun
failed logs are timestamped 21:39 and 21:46. The 2021122018 gdascleanup
log is also timestamped 21:39.
I am 99% confident that what is happening here is that the find/mtime thing is not working properly and removing files while other jobs are still running. Perhaps a copy does not change the modified time, as in the original times are preserved, and then they are removed when the job is still running?
FileHandler
preserves dates...
ls -l /scratch1/NCEPDEV/stmp2/Cory.R.Martin/RUNDIRS/snowenstest/enkfgdasesnowanl_00/orog/det
total 17280
-rw-r--r-- 1 Cory.R.Martin stmp 844183 Dec 12 2023 C96.mx500_oro_data.tile1.nc
-rw-r--r-- 1 Cory.R.Martin stmp 844183 Dec 12 2023 C96.mx500_oro_data.tile2.nc
-rw-r--r-- 1 Cory.R.Martin stmp 844183 Dec 12 2023 C96.mx500_oro_data.tile3.nc
-rw-r--r-- 1 Cory.R.Martin stmp 844183 Dec 12 2023 C96.mx500_oro_data.tile4.nc
-rw-r--r-- 1 Cory.R.Martin stmp 844183 Dec 12 2023 C96.mx500_oro_data.tile5.nc
-rw-r--r-- 1 Cory.R.Martin stmp 844183 Dec 12 2023 C96.mx500_oro_data.tile6.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12 2023 C96_grid.tile1.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12 2023 C96_grid.tile2.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12 2023 C96_grid.tile3.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12 2023 C96_grid.tile4.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12 2023 C96_grid.tile5.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12 2023 C96_grid.tile6.nc
-rw-r--r-- 1 Cory.R.Martin stmp 22862 Dec 12 2023 C96_mosaic.nc
Hera(hfe05):/scratch1/NCEPDEV/stmp2/Cory.R.Martin/RUNDIRS/snowenstest/enkfgdasesnowanl_00$ ls -l ./fv3jedi/
total 32
-rw-r--r-- 1 Cory.R.Martin stmp 8406 Jun 30 2022 akbk.nc4
-rw-r--r-- 1 Cory.R.Martin stmp 1567 Jun 30 2022 field_table
-rw-r--r-- 1 Cory.R.Martin stmp 362 Jun 30 2022 fmsmpp.nml
-rw-r--r-- 1 Cory.R.Martin stmp 492 Jun 13 17:20 fv3jedi_fieldmetadata_fv3inc.yaml
-rw-r--r-- 1 Cory.R.Martin stmp 1561 Jun 13 17:20 fv3jedi_fieldmetadata_history.yaml
-rw-r--r-- 1 Cory.R.Martin stmp 2154 Jun 13 17:20 fv3jedi_fieldmetadata_restart.yaml
As per @aerorahul 's request in PR #2719, comment out find
section in scripts/exglobal_cleanup.sh
. Done at c1ef4b3.
c1ef4b3 generates the following shellcheck error
Error: SHELLCHECK_WARNING:
./scripts/exglobal_cleanup.sh:17:1: warning[SC2034]: purge_every_days appears unused. Verify use (or export if used externally).
I will not resolve this error since I edited scripts/exglobal_cleanup.sh
as a test for @aerorahul
C48mx500_3DVarAOWCDA test
Install feature/rename_atm
at 8fc02e2 (created Mon Jun 24 18:05:11 2024) on Hera in /scratch1/NCEPDEV/da/Russ.Treadon/git/global-workflow/rename_atm_8fc02e2
. WCDA passes
Hera(hfe04):/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/rename_atm_8fc02e2_wcda$ rocotostat -d rename_atm_8fc02e2_wcda.db -w rename_atm_8fc02e2_wcda.xml -c all -s
CYCLE STATE ACTIVATED DEACTIVATED
202103241200 Done Jun 28 2024 12:06:56 Jun 28 2024 12:30:35
202103241800 Done Jun 28 2024 12:06:56 Jun 28 2024 13:40:32
Below is a list of commits to feature/rename_atm
from current head (c1ef4b30) back to 8fc02e29
c1ef4b30 comment out find lines for purge_every_days (#2719)
72bf7d88 Merge branch 'develop' into feature/rename_atm
9476c123 updated Finalize in Jenkinsfile and added try block around scm checkout (#2692)
d70afac0 Merge branch 'develop' into feature/rename_atm
968568f6 Activate snow DA test on WCOSS (#2720)
3e840357 Merge branch 'NOAA-EMC:develop' into feature/rename_atm
7706760b Cleanup of stale RUNDIRS from an experiment (#2719)
8da0821a Merge branch 'develop' into feature/rename_atm
89629916 Update logic for MOM6 number of layers/exception values (#2681)
12431f76 Update wave jobs to use COMIN/COMOUT (#2643)
bc33c2de Merge branch 'develop' into feature/rename_atm
b902c0ba Assign machine- and RUN-specific resources (#2672)
8fc02e29 add g-w python to PYTHONPATH on WCOSS2 in upp.sh (#2700)
These are all _merge branch develop
into feature/rename_atm
_ commits except for the last commit, c1ef4b30 , which was requested by @aerorahul
c1ef4b3 generates the following shellcheck error
Error: SHELLCHECK_WARNING: ./scripts/exglobal_cleanup.sh:17:1: warning[SC2034]: purge_every_days appears unused. Verify use (or export if used externally).
I will not resolve this error since I edited
scripts/exglobal_cleanup.sh
as a test for @aerorahul
It is safe to just comment out the line that defines purge_every_days
as well.
@aerorahul , do you mean you want the entire section
# Search and delete files/directories from DATAROOT/ older than ${purge_every_days} days
# purge_every_days should be a positive integer
purge_every_days=3
# Find and delete files older than ${purge_every_days} days
#find "${DATAROOT}/"* -type f -mtime "+${purge_every_days}" -exec rm -f {} \;
# Find and delete directories older than ${purge_every_days} days
#find "${DATAROOT}/"* -type d -mtime "+${purge_every_days}" -exec rm -rf {} \;
removed from exglobal_cleanup.sh
in feature/rename_atm
?
I thought EIB staff would open a hotfix PR to get a fast track correction into develop
and then we would merge the updated develop
into feature/rename_atm
.
@aerorahul , do you mean you want the entire section
# Search and delete files/directories from DATAROOT/ older than ${purge_every_days} days # purge_every_days should be a positive integer purge_every_days=3 # Find and delete files older than ${purge_every_days} days #find "${DATAROOT}/"* -type f -mtime "+${purge_every_days}" -exec rm -f {} \; # Find and delete directories older than ${purge_every_days} days #find "${DATAROOT}/"* -type d -mtime "+${purge_every_days}" -exec rm -rf {} \;
removed from
exglobal_cleanup.sh
infeature/rename_atm
?I thought EIB staff would open a hotfix PR to get a fast track correction into
develop
and then we would merge the updateddevelop
intofeature/rename_atm
.
Normally we would. It would need to be tested and that is why I requested to include it in your PR (one CI run to test both)
Makes sense. g-w CI is slow. Given this do you want me to remove the entire purge_every_days
section from the feature/rename_atmsnapshot of
exglobal_cleanup.sh. If we add the
exglobal_cleanup.sh` change the g-w team wants to PR #2700, there is no need for a hotfix PR.
Might be able to salvage the cleanup by changing mtime
to atime
so it looks at the last accessed time instead.
Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/EXPDIR/C48mx500_3DVarAOWCDA_c1ef4b30
Error Log: (here)
0: Rayleigh_Super E-folding time (mb days):
0: 1 1.2781460E-02 10.07095
0: 2 2.0334043E-02 10.61121
0: 3 3.1773422E-02 11.73400
0: 4 4.8782814E-02 13.60601
0: 5 7.3618531E-02 16.56944
0: 6 0.1092587 21.29239
0: 7 0.1595392 29.13565
0: 8 0.2292877 43.15143
0: 9 0.3244748 71.30933
0: 10 0.4523215 139.9974
0: 11 0.6213929 383.1961
0: 12 0.8416426 2896.457
0: slurmstepd: error: *** STEP 62601522.0 ON h35m20 CANCELLED AT 2024-06-28T15:45:55 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h35m20: tasks 1-3,5-11,13-39: Killed
srun: Terminating StepId=62601522.0
srun: error: h35m20: tasks 0,4: Killed
srun: error: h35m20: task 12: Killed
+ exglobal_forecast.sh[1]: postamble exglobal_forecast.sh 1719589491 137
+ preamble.sh[70]: set +x
End exglobal_forecast.sh at 15:45:57 with error code 137 (time elapsed: 00:01:06)
+ JGLOBAL_FORECAST[1]: postamble JGLOBAL_FORECAST 1719589466 137
+ preamble.sh[70]: set +x
End JGLOBAL_FORECAST at 15:45:57 with error code 137 (time elapsed: 00:01:31)
+ fcst.sh[1]: postamble fcst.sh 1719589463 137
+ preamble.sh[70]: set +x
End fcst.sh at 15:45:57 with error code 137 (time elapsed: 00:01:34)
_______________________________________________________________
Start Epilog on node h35m20 for job 62601522 :: Fri Jun 28 15:45:58 UTC 2024
Job 62601522 finished for user Terry.McGuinness in partition hera with exit code 137:0
_______________________________________________________________
End Epilogue Fri Jun 28 15:45:58 UTC 2024
C48mx500_3DVarAOWCDA FAILED on Hera for the same reason as before. The 20210324 18Z gdasfcst has NaN values. As noted above C48mx500_3DVarAOWCDA passes on Hera using an earlier snapshot of feature/rename_atm
before several mergers of g-w develop
into the branch.
I believe we have multiple, compounding errors/issues here. The forecast failure is caused by something else from the other failures (which are still likely due to the cleanup tasks)
I believe we have multiple, compounding errors/issues here. The forecast failure is caused by something else from the other failures (which are still likely due to the cleanup tasks)
The cleanup has been cleaned up in this test. So that cannot be it.
@aerorahul agreed, I meant that I bet the non-S2S tests will pass now (or fail in a different place)
I think the WCDA test failure is likely related to https://github.com/NOAA-EMC/global-workflow/pull/2681
@guillaumevernieres @AndrewEichmann-NOAA and myself have all been looking into this. I don't know if @AndrewEichmann-NOAA has updates yet on testing with PR #2681 completely reverted or not yet. I am building an update to that PR for testing but was hoping to get results from Andy first before hitting go on those.
I think the WCDA test failure is likely related to #2681
@guillaumevernieres @AndrewEichmann-NOAA and myself have all been looking into this. I don't know if @AndrewEichmann-NOAA has updates yet on testing with PR #2681 completely reverted or not yet. I am building an update to that PR for testing but was hoping to get results from Andy first before hitting go on those.
Reverting the missing value back to 0 fixes the issue. I thought I tested this properly, but apparently not.
Thank you @guillaumevernieres . I resubmitted the failed gdasfcst on Hera with the following change to parm/config/gfs/config.ufs
@@ -401,7 +401,7 @@ if [[ "${skip_mom6}" == "false" ]]; then
export cplflx=".true."
model_list="${model_list}.ocean"
nthreads_mom6=1
- MOM6_DIAG_MISVAL="-1e34"
+ MOM6_DIAG_MISVAL="0.0"
case "${mom6_res}" in
"500")
ntasks_mom6=8
This change is not sufficient or, more likely, I did not make the correct change. The WCDA gdasfcst still aborted. I now see that the PR #2681 change to config.ufs
is more involved. Let me wait for the experts to chime in.
Experiment C48_S2SWA_gefs FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C48_S2SWA_gefs_c1ef4b30
While the change is fine, do we want to comment out the block (current approach) or simply remove all the commented out scripting?
We can remove it and replace it at a later time. I leave it to you.
GitHub is experiencing issues in processing Pull Requests, so we are just waiting for it to come back, so this branch can be updated with develop and the CI can be kicked off.
Given github issues I will leave exglobal_cleanup.sh
alone.
Run C48mx500_3DVarAOWCDA from RussTreadon-NOAA:feature/rename_atm at dcec0813. 20210324 18Z gdasfcst successfully completed.
I'm sorry to say there was a stack overflow error on the Jenkins agent on Hera for the scripts that was monitoring the CI test, but the good news is that all test passed, we just here not able to report back the success with the automated system:
Terry.McGuinness (hfe06) RUNTESTS $ pwd
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS
Terry.McGuinness (hfe06) RUNTESTS $ cat ci-run_check.log
Experiment C48_ATM_dcec0813 Completed 1 Cycles: *SUCCESS* at Fri Jun 28 22:38:26 UTC 2024
Experiment C48mx500_3DVarAOWCDA_dcec0813 Completed 2 Cycles: *SUCCESS* at Fri Jun 28 23:02:49 UTC 2024
Experiment C48_S2SW_dcec0813 Completed 1 Cycles: *SUCCESS* at Sat Jun 29 00:26:49 UTC 2024
Experiment C96_atm3DVar_dcec0813 Completed 3 Cycles: *SUCCESS* at Sat Jun 29 00:34:14 UTC 2024
Experiment C96C48_hybatmDA_dcec0813 Completed 3 Cycles: *SUCCESS* at Sat Jun 29 00:40:20 UTC 2024
Experiment C96_atmaerosnowDA_dcec0813 Completed 3 Cycles: *SUCCESS* at Sat Jun 29 00:52:30 UTC 2024
Experiment C48_S2SWA_gefs_dcec0813 Completed 1 Cycles: *SUCCESS* at Sat Jun 29 13:04:08 UTC 2024
Setting the label to PASSED by hand.
Thank you @TerrenceMcGuinness-NOAA for the update. I was wondering what had happened. Great to hear that all tests passed on Hera.
Thank you @aerorahul , @WalterKolczynski-NOAA , and @TerrenceMcGuinness-NOAA for persistently working to get this PR into develop
.
Description
This PR updates the
gdas.cd
hash to bring in new JCB conventions. Resolves #2699From #2654 This PR will move much of the staging code that take place in the python initialization subroutines of the variational and ensemble DA jobs into Jinja2-templated YAML files to be passed into the wxflow file handler. Much of the staging has already been done this way, but this PR simply expands that strategy.
The old Python routines that were doing this staging are now removed. This is part of a broader refactoring of the pygfs tasking.
wxflow PR #30 is a companion to this PR.
Type of change
gdas.cd
hash)Change characteristics
How has this been tested?
Checklist