NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 162 forks source link

Update for JCB policies and stage DA job files with Jinja2-templates #2700

Closed RussTreadon-NOAA closed 1 day ago

RussTreadon-NOAA commented 2 weeks ago

Description

This PR updates the gdas.cd hash to bring in new JCB conventions. Resolves #2699

From #2654 This PR will move much of the staging code that take place in the python initialization subroutines of the variational and ensemble DA jobs into Jinja2-templated YAML files to be passed into the wxflow file handler. Much of the staging has already been done this way, but this PR simply expands that strategy.

The old Python routines that were doing this staging are now removed. This is part of a broader refactoring of the pygfs tasking.

wxflow PR #30 is a companion to this PR.

Type of change

Change characteristics

How has this been tested?

Checklist

RussTreadon-NOAA commented 1 week ago

@WalterKolczynski-NOAA , anything I can do to help move this PR forward?

aerorahul commented 1 week ago

@RussTreadon-NOAA Tests are failing on Hera due to full stmp disks. We need to be able to run the tests. Any help from developers in managing the shared space is greatly appreciated.

RussTreadon-NOAA commented 1 week ago

@RussTreadon-NOAA Tests are failing on Hera due to full stmp disks. We need to be able to run the tests. Any help from developers in managing the shared space is greatly appreciated.

I reduced the Hera stmp footprint for role.jedipara and my account. Mary has been sending Hera over quota emails. EMC management could send an email. While email brings the problem to everyone's attention, it alone doesn't reduce usages. As you say, developers need to free up stmp space.

RussTreadon-NOAA commented 1 week ago

@WalterKolczynski-NOAA and @aerorahul: This morning role.jedipara successfully completed g-w CI for

Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_gsida$ rocotostat -d pr2700_gsida.db -w pr2700_gsida.xml -c all -s CYCLE STATE ACTIVATED DEACTIVATED 202112201800 Done Jun 25 2024 12:00:22 Jun 25 2024 12:25:23 202112210000 Done Jun 25 2024 12:00:22 Jun 25 2024 15:25:18 202112210600 Done Jun 25 2024 12:00:22 Jun 25 2024 15:25:18

Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_aero$ rocotostat -d pr2700_aero.db -w pr2700_aero.xml -c all -s CYCLE STATE ACTIVATED DEACTIVATED 202112201200 Done Jun 25 2024 10:45:20 Jun 25 2024 11:05:14 202112201800 Done Jun 25 2024 10:45:20 Jun 25 2024 15:15:14 202112210000 Done Jun 25 2024 10:45:20 Jun 25 2024 15:25:13

Hera(hfe09):/scratch1/NCEPDEV/stmp2/role.jedipara/EXPDIR/pr2700_wcda$ rocotostat -d pr2700_wcda.db -w pr2700_wcda.xml -c all -s CYCLE STATE ACTIVATED DEACTIVATED 202103241200 Done Jun 25 2024 09:45:12 Jun 25 2024 10:00:34 202103241800 Done Jun 25 2024 09:45:12 Jun 25 2024 10:46:18


None of these CI parallels encountered disk quota problems on Hera this morning.
aerorahul commented 1 week ago

When the disk space returns to usable status, we will re-run the CI to ensure all tests (not just the DA tests) run to completion. It is unreasonable to ask you to run all the tests. It also defeats the purpose of developing automated testing. Thanks for your effort, diligence, and support.

RussTreadon-NOAA commented 1 week ago

COMROOT, EXPDIR, and RUNDIRS have been removed from /scratch1/NCEPDEV/stmp2/role.jedipara to free up additional Hera stmp space.

It would be good to document for EMC management the delay in g-w CI testing and PR merger caused by system and user issues (disk quota, heavy queue load, etc).

RussTreadon-NOAA commented 6 days ago

@WalterKolczynski-NOAA : should I be concerned about the message This branch cannot be rebased due to conflicts?

I thought we normally did a Squash and merge of PRs into develop. Now I see that Rebase and merge is selected.

WalterKolczynski-NOAA commented 6 days ago

@WalterKolczynski-NOAA : should I be concerned about the message This branch cannot be rebased due to conflicts?

I thought we normally did a Squash and merge of PRs into develop. Now I see that Rebase and merge is selected.

We do squash and merge into develop, but updating your branch with develop does not (and should not). You can either rebase or standard merge, but either way any conflicts will need to be addressed.

RussTreadon-NOAA commented 6 days ago

@WalterKolczynski-NOAA . RussTreadon-NOAA:feature/rename_atm is in sync with the current head of g-w develop. This PR is ready for automated g-w CI.

WalterKolczynski-NOAA commented 6 days ago

We're still holding this until space clears so we can run on Hera. That's the only place the AOWCDA test currently runs.

RussTreadon-NOAA commented 6 days ago

Thank you @WalterKolczynski-NOAA for the update.

It is very unfortunate that a g-w PR is stuck because we can't find sufficient disk space on Hera. As this PR documents, C48mx500_3DVarAOWCDA has successfully run on Hera more than once. These successful tests ran in /scratch1/NCEPDEV/stmp2. Does g-w CI have the ability to run in different Hera filesets?

aerorahul commented 6 days ago

Thank you @WalterKolczynski-NOAA for the update.

It is very unfortunate that a g-w PR is stuck because we can't find sufficient disk space on Hera. As this PR documents, C48mx500_3DVarAOWCDA has successfully run on Hera more than once. These successful tests ran in /scratch1/NCEPDEV/stmp2. Does g-w CI have the ability to run in different Hera filesets?

https://github.com/NOAA-EMC/global-workflow/blob/9476c1237af4adbc95f90bd1bdd34b6b99f2f8a3/workflow/hosts/hera.yaml#L7

The user account running the CI should have access to this location to create DATA.

RussTreadon-NOAA commented 5 days ago

AOWCDA heads up Set up g-w CI C48mx500_3DVarAOWCDA on Hera with

export PSLOT="pr2700_wcda"
export EXPDIR="/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/${PSLOT}"
export ROTDIR="/scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/${PSLOT}"
...
export STMP="/scratch1/NCEPDEV/stmp2/${USER}"
export PTMP="/scratch1/NCEPDEV/stmp4/${USER}"

in config.base. Jobs successfully ran up to 20210324 18Z gdasfcst. This job aborted with

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
20:
20:  (abort_ice)ABORTED:
20:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20

Prior to the above, see

12: WARNING from PE     0: read_field_3d:The variable Temp has an unlimited dimension in INPUT/mom6_increment.nc but no time level is specified.
12:
12:
12: WARNING from PE     0: read_field_3d:The variable Temp has an unlimited dimension in INPUT/mom6_increment.nc but no time level is specified.
12:
12:
12: WARNING from PE     0: read_field_3d:The variable Salt has an unlimited dimension in INPUT/mom6_increment.nc but no time level is specified.

in /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/pr2700_wcda/logs/2021032418/gdasfcst.log

ncdump -hcs /scratch1/NCEPDEV/stmp2/Russ.Treadon/COMROOT/pr2700_wcda/gdas.20210324/18/analysis/ocean/gdas.t18z.ocninc.nc returns

dimensions:
        xaxis_1 = 72 ;
        yaxis_1 = 35 ;
        zaxis_1 = 25 ;
        Time = UNLIMITED ; // (1 currently)
variables:
        double xaxis_1(xaxis_1) ;
                xaxis_1:long_name = "xaxis_1" ;
                xaxis_1:units = "none" ;
                xaxis_1:cartesian_axis = "X" ;
        double yaxis_1(yaxis_1) ;
                yaxis_1:long_name = "yaxis_1" ;
                yaxis_1:units = "none" ;
                yaxis_1:cartesian_axis = "Y" ;
        double zaxis_1(zaxis_1) ;
                zaxis_1:long_name = "zaxis_1" ;
                zaxis_1:units = "none" ;
                zaxis_1:cartesian_axis = "Z" ;
        double Time(Time) ;
                Time:long_name = "Time" ;
                Time:units = "time level" ;
                Time:cartesian_axis = "T" ;
        double Temp(Time, zaxis_1, yaxis_1, xaxis_1) ;
                Temp:long_name = "Temp" ;
                Temp:units = "none" ;
                Temp:checksum = "7C68000000000000" ;
...
        double h(Time, zaxis_1, yaxis_1, xaxis_1) ;
                h:long_name = "h" ;
                h:units = "none" ;
                h:checksum = "7830C06173333B6C" ;

// global attributes:
                :filename = ".//ocn.mom6_iau.incr.2021-03-24T15:00:00Z.nc" ;
                :_Format = "64-bit offset" ;
}

@guillaumevernieres , is TIME = UNLIMITED expected in ocninc.nc

Tagging @WalterKolczynski-NOAA for awareness

Oddity: While account_params indicates that stmp2 and stmp4 are over quota,

                Directory: /scratch1/NCEPDEV/stmp2 DiskInUse=724752 GB, Quota=700000 GB, Files=30194982, FileQUota=140000000
                Directory: /scratch1/NCEPDEV/stmp4 DiskInUse=724752 GB, Quota=700000 GB, Files=30194982, FileQUota=140000000

there are no disk quota exceeded messages in any of the pr2700_wcda log files. Are the numbers reported by account_params accurate?

CoryMartin-NOAA commented 5 days ago

Hera stmp is now below the 100% threshold. Can we kick off the final testing now to get this over the finish line?

emcbot commented 5 days ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_72bf7d88/logs/2021032418/gdasfcst.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 5 days ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C48mx500_3DVarAOWCDA_72bf7d88

emcbot commented 5 days ago

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gdasaeroanlrun.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gfsaeroanlrun.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 5 days ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C96_atmaerosnowDA_72bf7d88

RussTreadon-NOAA commented 5 days ago

C48mx500_3DVarAOWCDA FAILURE

A check of

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_72bf7d88/logs/2021032418/gdasfcst.log

shows the gdasfcst failure to be the same as reported above .

I do not know the reason for this failure. @JessicaMeixner-NOAA or @guillaumevernieres , have you seen this type of error before in the wcda system?

Here is model printout when the model aborted

21:  (abort_ice)ABORTED: 
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
20:   
20:  (abort_ice)ABORTED: 
20:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20
RussTreadon-NOAA commented 5 days ago

C96_atmaerosnowDA FAILURE

A check of the log files

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gdasaeroanlrun.log
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100/gfsaeroanlrun.log

shows that both jobs failed for the same reason. An expected input fix file is not found

  0: OOPS_STATS Run start                                - Runtime:      2.35 sec,  Memory: total:    48.86 Gb, per task: min =   123.86 Mb, max =   141.50 Mb
  0: Run: Starting oops::Variational<FV3JEDI, UFO and IODA observations>
  0: OOPS_STATS Variational start                        - Runtime:     2.35 sec,  Local Memory:   142.58 Mb
  0:
  0: FATAL from PE     0: get_ascii_file_num_lines: File ./fv3jedi/fmsmpp.nml does not exist.
  0:
  0:
  0: FATAL from PE     0: get_ascii_file_num_lines: File ./fv3jedi/fmsmpp.nml does not exist.
  0:
  0: Image              PC                Routine            Line        Source
  0: libifcoremt.so.5   000014D1B5A52DCB  tracebackqq_          Unknown  Unknown
  0: libfms.so          000014D18E21AA2E  mpp_mod_mp_mpp_er     Unknown  Unknown

A check of the run directories for both the gfs and gdas jobs shows only a single file in ./fv3jedi

 /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C96_atmaerosnowDA_72bf7d88/gdasaeroanl_00/fv3jedi:
  total used in directory 12 available 511827864
  drwxr-sr-x 2 Terry.McGuinness stmp 4096 Jun 27 21:39 .
  drwxr-sr-x 9 Terry.McGuinness stmp 4096 Jun 27 21:46 ..
  -rw-r--r-- 1 Terry.McGuinness stmp 2154 Jun 24 15:07 fv3jedi_fieldmetadata_restart.yaml

However, both the gfs and gdas aeroanlinit jobs indicate that the missing file was copied to the run directory. For example, gdasaeroanlinit.log contains

^[[38;21m2024-06-27 21:33:36,366 - INFO     - file_utils  : Copied /scratch1/NCEPDEV/global/CI/2700/gfs/fix/gdas/fv3jedi/fv3files/fmsmpp.nml to /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C96_atmaerosnowDA_72bf7d88/gdasaeroanl_00/fv3jedi/fmsmpp.nml^[[0m

ls -l of the source file confirms that it exists

Hera(hfe05):/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C96_atmaerosnowDA_72bf7d88/logs/2021122100$ ls -l /scratch1/NCEPDEV/global/CI/2700/gfs/fix/gdas/fv3jedi/fv3files/fmsmpp.nml
-rw-r--r-- 1 role.glopara global 362 Jun 30  2022 /scratch1/NCEPDEV/global/CI/2700/gfs/fix/gdas/fv3jedi/fv3files/fmsmpp.nml

Interestingly g-w PR #2729 experienced the same failure on WCOSS2.

I can not explain this behavior at present.

@andytangborn , have you seen this in error in any of your aerosol tests?

guillaumevernieres commented 5 days ago

C48mx500_3DVarAOWCDA FAILURE

A check of

/scratch1/NCEPDEV/global/CI/2700/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_72bf7d88/logs/2021032418/gdasfcst.log

shows the gdasfcst failure to be the same as reported above .

I do not know the reason for this failure. @JessicaMeixner-NOAA or @guillaumevernieres , have you seen this type of error before in the wcda system?

Here is model printout when the model aborted

21:  (abort_ice)ABORTED: 
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
20:   
20:  (abort_ice)ABORTED: 
20:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
20: Abort(128) on node 20 (rank 20 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 20

@RussTreadon-NOAA , I don't think cice is the issue, the problem starts with a NaN in Tsfc. I'll do some digging, but I first need to figure out how to do x11 forwarding with my gfe ...

WalterKolczynski-NOAA commented 5 days ago

I only turned the C96_atmaerosnowDA test on yesterday (#2720), but it passed then, both when I ran it manually and during the CI process. Given it is now failing across the board in multiple PRs on WCOSS, my instinct would be that something outside of global-workflow changed.

Edit: or possibly a PR that was merged independently around the same time that didn't have each other's changes.

RussTreadon-NOAA commented 4 days ago

I, too, successfully ran C96_atmaerosnowDA under role.jedipara and my user account on Hera in the past. The fact that C96_atmaerosnowDA fails with both an updated gdas.cd hash (this PR) and an old gdas.cd hash (PR #2729) suggests that the root cause for the failure may lay outside of GDASApp.

Stronger evidence is the fact that PR #2729 and #2720 use the same gdas.cd hash. Running C96_atmaerosnowDA with #2720 passed. Running C96_atmaerosnowDA with #2729 failed. Since both PRs use the same gdas.cd hash this suggests the root cause is outside GDASApp.

JessicaMeixner-NOAA commented 4 days ago

@RussTreadon-NOAA I have not seen that negative seaice area issue before. @guillaumevernieres - let me know how I can help looking into this issue more.

CoryMartin-NOAA commented 4 days ago

I believe #2719 needs to be reverted.

Consider the case above C96_atmaerosnowDA_72bf7d88.

The 2021122100 gfsaeroanlinit and gdasaeroanlinit logs are timestamped 21:33. The gfsaeroanlrun and gdasaeroanlrun failed logs are timestamped 21:39 and 21:46. The 2021122018 gdascleanup log is also timestamped 21:39.

I am 99% confident that what is happening here is that the find/mtime thing is not working properly and removing files while other jobs are still running. Perhaps a copy does not change the modified time, as in the original times are preserved, and then they are removed when the job is still running?

CoryMartin-NOAA commented 4 days ago

FileHandler preserves dates...

ls -l /scratch1/NCEPDEV/stmp2/Cory.R.Martin/RUNDIRS/snowenstest/enkfgdasesnowanl_00/orog/det

total 17280
-rw-r--r-- 1 Cory.R.Martin stmp  844183 Dec 12  2023 C96.mx500_oro_data.tile1.nc
-rw-r--r-- 1 Cory.R.Martin stmp  844183 Dec 12  2023 C96.mx500_oro_data.tile2.nc
-rw-r--r-- 1 Cory.R.Martin stmp  844183 Dec 12  2023 C96.mx500_oro_data.tile3.nc
-rw-r--r-- 1 Cory.R.Martin stmp  844183 Dec 12  2023 C96.mx500_oro_data.tile4.nc
-rw-r--r-- 1 Cory.R.Martin stmp  844183 Dec 12  2023 C96.mx500_oro_data.tile5.nc
-rw-r--r-- 1 Cory.R.Martin stmp  844183 Dec 12  2023 C96.mx500_oro_data.tile6.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12  2023 C96_grid.tile1.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12  2023 C96_grid.tile2.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12  2023 C96_grid.tile3.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12  2023 C96_grid.tile4.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12  2023 C96_grid.tile5.nc
-rw-r--r-- 1 Cory.R.Martin stmp 2096689 Dec 12  2023 C96_grid.tile6.nc
-rw-r--r-- 1 Cory.R.Martin stmp   22862 Dec 12  2023 C96_mosaic.nc
Hera(hfe05):/scratch1/NCEPDEV/stmp2/Cory.R.Martin/RUNDIRS/snowenstest/enkfgdasesnowanl_00$ ls -l ./fv3jedi/
total 32
-rw-r--r-- 1 Cory.R.Martin stmp 8406 Jun 30  2022 akbk.nc4
-rw-r--r-- 1 Cory.R.Martin stmp 1567 Jun 30  2022 field_table
-rw-r--r-- 1 Cory.R.Martin stmp  362 Jun 30  2022 fmsmpp.nml
-rw-r--r-- 1 Cory.R.Martin stmp  492 Jun 13 17:20 fv3jedi_fieldmetadata_fv3inc.yaml
-rw-r--r-- 1 Cory.R.Martin stmp 1561 Jun 13 17:20 fv3jedi_fieldmetadata_history.yaml
-rw-r--r-- 1 Cory.R.Martin stmp 2154 Jun 13 17:20 fv3jedi_fieldmetadata_restart.yaml
RussTreadon-NOAA commented 4 days ago

As per @aerorahul 's request in PR #2719, comment out find section in scripts/exglobal_cleanup.sh. Done at c1ef4b3.

RussTreadon-NOAA commented 4 days ago

c1ef4b3 generates the following shellcheck error

Error: SHELLCHECK_WARNING:
./scripts/exglobal_cleanup.sh:17:1: warning[SC2034]: purge_every_days appears unused. Verify use (or export if used externally).

I will not resolve this error since I edited scripts/exglobal_cleanup.sh as a test for @aerorahul

RussTreadon-NOAA commented 4 days ago

C48mx500_3DVarAOWCDA test

Install feature/rename_atm at 8fc02e2 (created Mon Jun 24 18:05:11 2024) on Hera in /scratch1/NCEPDEV/da/Russ.Treadon/git/global-workflow/rename_atm_8fc02e2. WCDA passes


Hera(hfe04):/scratch1/NCEPDEV/stmp2/Russ.Treadon/EXPDIR/rename_atm_8fc02e2_wcda$ rocotostat -d rename_atm_8fc02e2_wcda.db -w rename_atm_8fc02e2_wcda.xml -c all -s
   CYCLE         STATE           ACTIVATED              DEACTIVATED
202103241200        Done    Jun 28 2024 12:06:56    Jun 28 2024 12:30:35
202103241800        Done    Jun 28 2024 12:06:56    Jun 28 2024 13:40:32

Below is a list of commits to feature/rename_atm from current head (c1ef4b30) back to 8fc02e29

c1ef4b30 comment out find lines for purge_every_days (#2719)
72bf7d88 Merge branch 'develop' into feature/rename_atm
9476c123 updated Finalize in Jenkinsfile and added try block around scm checkout (#2692)
d70afac0 Merge branch 'develop' into feature/rename_atm
968568f6 Activate snow DA test on WCOSS (#2720)
3e840357 Merge branch 'NOAA-EMC:develop' into feature/rename_atm
7706760b Cleanup of stale RUNDIRS from an experiment (#2719)
8da0821a Merge branch 'develop' into feature/rename_atm
89629916 Update logic for MOM6 number of layers/exception values (#2681)
12431f76 Update wave jobs to use COMIN/COMOUT (#2643)
bc33c2de Merge branch 'develop' into feature/rename_atm
b902c0ba Assign machine- and RUN-specific resources (#2672)
8fc02e29 add g-w python to PYTHONPATH on WCOSS2 in upp.sh (#2700)

These are all _merge branch develop into feature/rename_atm_ commits except for the last commit, c1ef4b30 , which was requested by @aerorahul

aerorahul commented 4 days ago

c1ef4b3 generates the following shellcheck error

Error: SHELLCHECK_WARNING:
./scripts/exglobal_cleanup.sh:17:1: warning[SC2034]: purge_every_days appears unused. Verify use (or export if used externally).

I will not resolve this error since I edited scripts/exglobal_cleanup.sh as a test for @aerorahul

It is safe to just comment out the line that defines purge_every_days as well.

RussTreadon-NOAA commented 4 days ago

@aerorahul , do you mean you want the entire section

# Search and delete files/directories from DATAROOT/ older than ${purge_every_days} days
# purge_every_days should be a positive integer
purge_every_days=3

# Find and delete files older than ${purge_every_days} days
#find "${DATAROOT}/"* -type f -mtime "+${purge_every_days}" -exec rm -f {} \;

# Find and delete directories older than ${purge_every_days} days
#find "${DATAROOT}/"* -type d -mtime "+${purge_every_days}" -exec rm -rf {} \;

removed from exglobal_cleanup.sh in feature/rename_atm?

I thought EIB staff would open a hotfix PR to get a fast track correction into develop and then we would merge the updated develop into feature/rename_atm.

aerorahul commented 4 days ago

@aerorahul , do you mean you want the entire section

# Search and delete files/directories from DATAROOT/ older than ${purge_every_days} days
# purge_every_days should be a positive integer
purge_every_days=3

# Find and delete files older than ${purge_every_days} days
#find "${DATAROOT}/"* -type f -mtime "+${purge_every_days}" -exec rm -f {} \;

# Find and delete directories older than ${purge_every_days} days
#find "${DATAROOT}/"* -type d -mtime "+${purge_every_days}" -exec rm -rf {} \;

removed from exglobal_cleanup.sh in feature/rename_atm?

I thought EIB staff would open a hotfix PR to get a fast track correction into develop and then we would merge the updated develop into feature/rename_atm.

Normally we would. It would need to be tested and that is why I requested to include it in your PR (one CI run to test both)

RussTreadon-NOAA commented 4 days ago

Makes sense. g-w CI is slow. Given this do you want me to remove the entire purge_every_days section from the feature/rename_atmsnapshot ofexglobal_cleanup.sh. If we add theexglobal_cleanup.sh` change the g-w team wants to PR #2700, there is no need for a hotfix PR.

WalterKolczynski-NOAA commented 4 days ago

Might be able to salvage the cleanup by changing mtime to atime so it looks at the last accessed time instead.

emcbot commented 4 days ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/EXPDIR/C48mx500_3DVarAOWCDA_c1ef4b30

Error Log: (here)

 0:  Rayleigh_Super E-folding time (mb days):
 0:            1  1.2781460E-02   10.07095
 0:            2  2.0334043E-02   10.61121
 0:            3  3.1773422E-02   11.73400
 0:            4  4.8782814E-02   13.60601
 0:            5  7.3618531E-02   16.56944
 0:            6  0.1092587       21.29239
 0:            7  0.1595392       29.13565
 0:            8  0.2292877       43.15143
 0:            9  0.3244748       71.30933
 0:           10  0.4523215       139.9974
 0:           11  0.6213929       383.1961
 0:           12  0.8416426       2896.457
 0: slurmstepd: error: *** STEP 62601522.0 ON h35m20 CANCELLED AT 2024-06-28T15:45:55 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: h35m20: tasks 1-3,5-11,13-39: Killed
srun: Terminating StepId=62601522.0
srun: error: h35m20: tasks 0,4: Killed
srun: error: h35m20: task 12: Killed
+ exglobal_forecast.sh[1]: postamble exglobal_forecast.sh 1719589491 137
+ preamble.sh[70]: set +x
End exglobal_forecast.sh at 15:45:57 with error code 137 (time elapsed: 00:01:06)
+ JGLOBAL_FORECAST[1]: postamble JGLOBAL_FORECAST 1719589466 137
+ preamble.sh[70]: set +x
End JGLOBAL_FORECAST at 15:45:57 with error code 137 (time elapsed: 00:01:31)
+ fcst.sh[1]: postamble fcst.sh 1719589463 137
+ preamble.sh[70]: set +x
End fcst.sh at 15:45:57 with error code 137 (time elapsed: 00:01:34)
_______________________________________________________________
Start Epilog on node h35m20 for job 62601522 :: Fri Jun 28 15:45:58 UTC 2024
Job 62601522 finished for user Terry.McGuinness in partition hera with exit code 137:0
_______________________________________________________________
End Epilogue Fri Jun 28 15:45:58 UTC 2024
RussTreadon-NOAA commented 4 days ago

C48mx500_3DVarAOWCDA FAILED on Hera for the same reason as before. The 20210324 18Z gdasfcst has NaN values. As noted above C48mx500_3DVarAOWCDA passes on Hera using an earlier snapshot of feature/rename_atm before several mergers of g-w develop into the branch.

CoryMartin-NOAA commented 4 days ago

I believe we have multiple, compounding errors/issues here. The forecast failure is caused by something else from the other failures (which are still likely due to the cleanup tasks)

aerorahul commented 4 days ago

I believe we have multiple, compounding errors/issues here. The forecast failure is caused by something else from the other failures (which are still likely due to the cleanup tasks)

The cleanup has been cleaned up in this test. So that cannot be it.

CoryMartin-NOAA commented 4 days ago

@aerorahul agreed, I meant that I bet the non-S2S tests will pass now (or fail in a different place)

JessicaMeixner-NOAA commented 4 days ago

I think the WCDA test failure is likely related to https://github.com/NOAA-EMC/global-workflow/pull/2681

@guillaumevernieres @AndrewEichmann-NOAA and myself have all been looking into this. I don't know if @AndrewEichmann-NOAA has updates yet on testing with PR #2681 completely reverted or not yet. I am building an update to that PR for testing but was hoping to get results from Andy first before hitting go on those.

guillaumevernieres commented 4 days ago

I think the WCDA test failure is likely related to #2681

@guillaumevernieres @AndrewEichmann-NOAA and myself have all been looking into this. I don't know if @AndrewEichmann-NOAA has updates yet on testing with PR #2681 completely reverted or not yet. I am building an update to that PR for testing but was hoping to get results from Andy first before hitting go on those.

Reverting the missing value back to 0 fixes the issue. I thought I tested this properly, but apparently not.

RussTreadon-NOAA commented 4 days ago

Thank you @guillaumevernieres . I resubmitted the failed gdasfcst on Hera with the following change to parm/config/gfs/config.ufs

@@ -401,7 +401,7 @@ if [[ "${skip_mom6}" == "false" ]]; then
   export cplflx=".true."
   model_list="${model_list}.ocean"
   nthreads_mom6=1
-  MOM6_DIAG_MISVAL="-1e34"
+  MOM6_DIAG_MISVAL="0.0"
   case "${mom6_res}" in
     "500")
       ntasks_mom6=8

This change is not sufficient or, more likely, I did not make the correct change. The WCDA gdasfcst still aborted. I now see that the PR #2681 change to config.ufs is more involved. Let me wait for the experts to chime in.

emcbot commented 4 days ago

Experiment C48_S2SWA_gefs FAILED on Hera in /scratch1/NCEPDEV/global/CI/2700/RUNTESTS/C48_S2SWA_gefs_c1ef4b30

aerorahul commented 4 days ago

While the change is fine, do we want to comment out the block (current approach) or simply remove all the commented out scripting?

We can remove it and replace it at a later time. I leave it to you. GitHub is experiencing issues in processing Pull Requests, so we are just waiting for it to come back, so this branch can be updated with develop and the CI can be kicked off. Screenshot 2024-06-28 at 3 13 04 PM

RussTreadon-NOAA commented 4 days ago

Given github issues I will leave exglobal_cleanup.sh alone.

RussTreadon-NOAA commented 4 days ago

Run C48mx500_3DVarAOWCDA from RussTreadon-NOAA:feature/rename_atm at dcec0813. 20210324 18Z gdasfcst successfully completed.

TerrenceMcGuinness-NOAA commented 3 days ago

I'm sorry to say there was a stack overflow error on the Jenkins agent on Hera for the scripts that was monitoring the CI test, but the good news is that all test passed, we just here not able to report back the success with the automated system:

Terry.McGuinness (hfe06) RUNTESTS $ pwd
/scratch1/NCEPDEV/global/CI/2700/RUNTESTS
Terry.McGuinness (hfe06) RUNTESTS $ cat ci-run_check.log     
Experiment C48_ATM_dcec0813 Completed 1 Cycles: *SUCCESS* at Fri Jun 28 22:38:26 UTC 2024
Experiment C48mx500_3DVarAOWCDA_dcec0813 Completed 2 Cycles: *SUCCESS* at Fri Jun 28 23:02:49 UTC 2024
Experiment C48_S2SW_dcec0813 Completed 1 Cycles: *SUCCESS* at Sat Jun 29 00:26:49 UTC 2024
Experiment C96_atm3DVar_dcec0813 Completed 3 Cycles: *SUCCESS* at Sat Jun 29 00:34:14 UTC 2024
Experiment C96C48_hybatmDA_dcec0813 Completed 3 Cycles: *SUCCESS* at Sat Jun 29 00:40:20 UTC 2024
Experiment C96_atmaerosnowDA_dcec0813 Completed 3 Cycles: *SUCCESS* at Sat Jun 29 00:52:30 UTC 2024
Experiment C48_S2SWA_gefs_dcec0813 Completed 1 Cycles: *SUCCESS* at Sat Jun 29 13:04:08 UTC 2024

Setting the label to PASSED by hand.

RussTreadon-NOAA commented 2 days ago

Thank you @TerrenceMcGuinness-NOAA for the update. I was wondering what had happened. Great to hear that all tests passed on Hera.

RussTreadon-NOAA commented 1 day ago

Thank you @aerorahul , @WalterKolczynski-NOAA , and @TerrenceMcGuinness-NOAA for persistently working to get this PR into develop.