NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 161 forks source link

CI self-test with KEEPDATA=YES #2734

Open TerrenceMcGuinness-NOAA opened 3 days ago

TerrenceMcGuinness-NOAA commented 3 days ago

Description

This is a CI self-test with KEEPDATA=YES for save off of RUNDIRS to capture disk costs of running CI tests.

emcbot commented 3 days ago

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/logs/2021122018/gdassfcanl.log

Follow link here to view the contents of the above file(s): (link)

emcbot commented 3 days ago

Experiment C96_atmaerosnowDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atmaerosnowDA_7e868a54

emcbot commented 3 days ago

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48mx500_3DVarAOWCDA_7e868a54

RussTreadon-NOAA commented 2 days ago

C48mx500_3DVarAOWCDA failure

The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

As @guillaumevernieres notes, the log file contains Tsfcn NaN for n = 1 through 5.

This PR uses an older gdas.cd hash. PR #2700 uses a newer gdas.cd hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara and Russ.Treadon.

RussTreadon-NOAA commented 2 days ago

C96_atmaerosnowDA failure

The C96_atmaerosnowDA failure in this PR differs from PR #2700 and #2729. The 20211220 18Z gdassfcanl fails in this PR with the error message

2:  FATAL ERROR: OPENING FILE: ./fnbgsi.003: NetCDF: Unknown file format
2:  STOP.
2: Abort(999) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 2

Local file fnbgsi.003 is a copy of 20211220.180000.sfc_data.tile3.nc from /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow. The source file is zero length

 /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow:
  total used in directory 109600 available 71308833872
  drwxrwsr-x 2 Terry.McGuinness global     4096 Jun 28 01:33 .
  drwxr-sr-x 5 Terry.McGuinness global     4096 Jun 28 01:33 ..
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile1.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile2.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile3.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile4.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile5.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile6.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
  -rw-r--r-- 1 Terry.McGuinness global        0 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc

According to gdassnowanl.log file 20211220.180000.sfc_data.tile3.nc originates as

./gdassnowanl.log:2024-06-28 01:33:46,896 - INFO     - file_utils  : Copied /scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl/20211220.180000.sfc_data.tile3.nc to /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18//analysis/snow/20211220.180000.sfc_data.tile3.nc

File 20211220.180000.sfc_data.tile3.nc is a non-zero length file.

Hera(hfe04):/scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl$ ls -l 20211220.180000.sfc_data*
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc

I am not familiar with snow DA. Tagging @jiaruidong2017 . Jiarui, what are your thoughts on this failure?

jiaruidong2017 commented 2 days ago

Thanks @RussTreadon-NOAA for digging this. I actually don't have any idea why this happened, and I didn't meet such an issue from my previous tests. A rerun to this CI test may help to find the reason. @CoryMartin-NOAA do you have any thoughts on this?

RussTreadon-NOAA commented 2 days ago

Thank you @jiaruidong2017 for your reply. Do you routinely run C96_atmaerosnowDA as part of your development? If not, how do / how frequently do you test JEDI snow DA in g-w?

jiaruidong2017 commented 2 days ago

@RussTreadon-NOAA I actually didn't run the C96_atmaerosnowDA CI test for my development work, but instead I run my own JEDI snow DA test. Recently, I have run my tests four times over the past two weeks.

RussTreadon-NOAA commented 2 days ago

@jiaruidong2017 , to help with debugging, when did you make these runs, on which machine, and do you still have the log files online?

jiaruidong2017 commented 2 days ago

@RussTreadon-NOAA You can find the following log files for my three tests as:

/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory04/logs/ (Today) /scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory03/logs/ (June 26) /scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory02/logs/ (June 15)

guillaumevernieres commented 2 days ago

C48mx500_3DVarAOWCDA failure

The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

As @guillaumevernieres notes, the log file contains Tsfcn NaN for n = 1 through 5.

This PR uses an older gdas.cd hash. PR #2700 uses a newer gdas.cd hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara and Russ.Treadon.

@JessicaMeixner-NOAA just checked, the ocean and seaice increments are all nans.

JessicaMeixner-NOAA commented 2 days ago

PR https://github.com/NOAA-EMC/global-workflow/pull/2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that https://github.com/NOAA-EMC/global-workflow/pull/2719 is also possibly causing issues for tests not related to WCDA based on some other threads.

aerorahul commented 2 days ago

PR #2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that #2719 is also possibly causing issues for tests not related to WCDA based on some other threads.

@JessicaMeixner-NOAA Thanks. I think it is the aggressive clean-up from #2719 that is likely the root cause. I have left a comment for it in #2719 and #2700 to test that.