Open TerrenceMcGuinness-NOAA opened 3 days ago
Experiment C96_atmaerosnowDA FAILED on Hera with error logs:
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/logs/2021122018/gdassfcanl.log
Follow link here to view the contents of the above file(s): (link)
Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atmaerosnowDA_7e868a54
Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48mx500_3DVarAOWCDA_7e868a54
C48mx500_3DVarAOWCDA failure
The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts
21: (abort_ice)ABORTED:
21: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
As @guillaumevernieres notes, the log file contains Tsfcn NaN
for n = 1 through 5.
This PR uses an older gdas.cd
hash. PR #2700 uses a newer gdas.cd
hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara
and Russ.Treadon
.
C96_atmaerosnowDA failure
The C96_atmaerosnowDA failure in this PR differs from PR #2700 and #2729. The 20211220 18Z gdassfcanl fails in this PR with the error message
2: FATAL ERROR: OPENING FILE: ./fnbgsi.003: NetCDF: Unknown file format
2: STOP.
2: Abort(999) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 2
Local file fnbgsi.003
is a copy of 20211220.180000.sfc_data.tile3.nc
from /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow
. The source file is zero length
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow:
total used in directory 109600 available 71308833872
drwxrwsr-x 2 Terry.McGuinness global 4096 Jun 28 01:33 .
drwxr-sr-x 5 Terry.McGuinness global 4096 Jun 28 01:33 ..
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile1.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile2.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile3.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile4.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile5.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile6.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
-rw-r--r-- 1 Terry.McGuinness global 0 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc
According to gdassnowanl.log file 20211220.180000.sfc_data.tile3.nc
originates as
./gdassnowanl.log:2024-06-28 01:33:46,896 - INFO - file_utils : Copied /scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl/20211220.180000.sfc_data.tile3.nc to /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18//analysis/snow/20211220.180000.sfc_data.tile3.nc
File 20211220.180000.sfc_data.tile3.nc
is a non-zero length file.
Hera(hfe04):/scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl$ ls -l 20211220.180000.sfc_data*
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc
I am not familiar with snow DA. Tagging @jiaruidong2017 . Jiarui, what are your thoughts on this failure?
Thanks @RussTreadon-NOAA for digging this. I actually don't have any idea why this happened, and I didn't meet such an issue from my previous tests. A rerun to this CI test may help to find the reason. @CoryMartin-NOAA do you have any thoughts on this?
Thank you @jiaruidong2017 for your reply. Do you routinely run C96_atmaerosnowDA as part of your development? If not, how do / how frequently do you test JEDI snow DA in g-w?
@RussTreadon-NOAA I actually didn't run the C96_atmaerosnowDA CI test for my development work, but instead I run my own JEDI snow DA test. Recently, I have run my tests four times over the past two weeks.
@jiaruidong2017 , to help with debugging, when did you make these runs, on which machine, and do you still have the log files online?
@RussTreadon-NOAA You can find the following log files for my three tests as:
/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory04/logs/ (Today) /scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory03/logs/ (June 26) /scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory02/logs/ (June 15)
C48mx500_3DVarAOWCDA failure
The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts
21: (abort_ice)ABORTED: 21: (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice) 21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21
As @guillaumevernieres notes, the log file contains
Tsfcn NaN
for n = 1 through 5.This PR uses an older
gdas.cd
hash. PR #2700 uses a newergdas.cd
hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run underrole.jedipara
andRuss.Treadon
.
@JessicaMeixner-NOAA just checked, the ocean and seaice increments are all nans.
PR https://github.com/NOAA-EMC/global-workflow/pull/2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that https://github.com/NOAA-EMC/global-workflow/pull/2719 is also possibly causing issues for tests not related to WCDA based on some other threads.
PR #2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that #2719 is also possibly causing issues for tests not related to WCDA based on some other threads.
@JessicaMeixner-NOAA Thanks. I think it is the aggressive clean-up from #2719 that is likely the root cause. I have left a comment for it in #2719 and #2700 to test that.
Description
This is a CI self-test with KEEPDATA=YES for save off of RUNDIRS to capture disk costs of running CI tests.