Closed lgannoaa closed 2 years ago
gfs.v16.3.0 retrospective parallel started from 2021101518z on Cactus. So far, we have about 3-4 week results. The overall forecast skills show degradation in NH. The DA team investigated to look for possible causes and solutions. The run configured and maintained by @lgannoaa has been very helpful for the DA team to spot a couple of issues from the gfsda.v16.3.0 package.
(1) An initialization problem for satellite bias correction coefficients were found for sensors with coefficients initialized from zero. The quasi-mode initialization procedure was skipped due to a bug merged from the GSI develop to gfs.v16.3.0
The issue and diagnostics are documented in GSI Issue #438 The bug fix is provided in GSI PR #439 The bug fix had been merged into gfsda.v16.3.0
A short gfs.v16.3.0 parallel test (v163t) was performed to verify the bug fix
(2) Increasing NSST biases and RMS of O-F (no bias) are observed in the time seires of AVHRR MetOp-B channel 3 and the window channels from hyperspectral sensors (IASI, CrIS). Foundation temperature bias and rms compared to operational GFS and OSTIA increase with time. It was found that the NSST increment file from GSI was not passing into the global cycle properly.
The issue and diagnostics in detail are documented in GSI Issue #449
The bug fix is documented in GSI PR #448
A short gfs.v16.3.0 real-time parallel (starting from 2022061918z; v163ctl) with the bug fixes from (1) and (2) is currently running on Dogwood to verify the bug fixes.
We will keep this running for a few days....
Here is the link to the Verification page: https://www.emc.ncep.noaa.gov/gc_wmb/eliu/v163ctl/
We should stop the retrospective parallel on Cactus and re-run it with the bug fixes.
NCO announced that Cactus will become dev machine in the evening of Aug 4th. Retro will start with CDATE=2021101518.
Retro is now started on Aug. 4th. evening.
retro paused on CDATE=2021101900 in the morning of Aug. 5th due to HPSS transfer slowness which caused high COM usage. Aug. 5th evening, the transfer speed remain slow. Parallel remain paused.
Tag: @emilyhcliu @dtkleist @junwang-noaa @emilyhcliu and @dtkleist today made a decision to modify this parallel to write restart files and archive to HPSS every 7 days. This change is now in place.
Cactus have multiple system issues include job submit issue, missing jobs, zero size files, and archive jobs disappear on Aug 5th evening. Multiple rerun and clean up was performed. Resumed on CDATE=2021101900.
Cactus has file system issue caused para check job failed. Example message: mkdir: cannot create directory â /retro1-v16-ecf2021101818checkâ : Permission denied Cactus has hpss transfer system issue cause multiple archive job failed. Example error message: Cannot send after transport endpoint shutdown ecen, efcs jobs become zombie jobs. Archive jobs continue to fail after several attempt try to recover the parallel. Therefore, this parallel is paused on CDATE=2021101906 for the remaining weekend.
This parallel is resumed in morning Aug. 8th. Cactus archive job continue impacted with system issue "Cannot send after transport endpoint shutdown". Helpdesk ticket sent: Ticket#2022080810000045 NCO fixed the system issue. The parallel is now resumed. However, due to the system issue some files are already cleanout in PTMP. Caused incomplete archive jobs: Impacted CDATE=2021101518 to 2021101718, 2021101800, 2021101806, 2021101906 and 2021102012, 2021102018. @emilyhcliu agreed to continue the parallel as is in a meeting on Aug 8th.
8/9 increate eupd job by 10 minutes because it has multiple fail due to wall clock.
The transfer speed remain slow over the night on 8/9 until morning 8/10 PTMP reached critical limit because archive job can't finish. Parallel is paused on CDATE=2021102512 until transfer jobs finished. Tag: @emilyhcliu @dtkleist @junwang-noaa
Parallel resumed on CDATE=2021102518 for one cycle. It will be paused on CDATE=2021102600 in preparation of wafs testing. Starting CDATE=2021102518 post is now using new tag update upp_v8.2.0 (02086a8) and wafs is using tag gfs_wafs.v6.3.1 (da909f).
Parallel paused on CDATE=2021102806 due to system error in archive jobs and high PTMP usage. Disk quota exceeded on group PTMP.
Rerun a few archive zombie jobs to keep parallel going and PTMP clean up to continue. At 10;00 EST 8/13 current CDATE=2021103106. Tested WAFS GCIP job on CDATE=2021103100. It failed. Email has been sent to the developer.
PTMP fill up last night, parallel paused for a few hours. It is resumed at CDATE=2021110406.
wafs testing is now completed. Code manager checked output and log found no issue.
Gempak and awips downstream code manager checked output and log on a 00Z test found no issue.
Bufr sounding code manager checked output and log on a 00Z test found no issue.
Parallel paused for a few hours due to transfer job system issue. After rerun 34 jobs, the parallel is now resumed on CDATE=2021110612.
Emergency failover of production to cactus. This parallel is now paused in preparation to run on white space. Cactus is now the production machine. Effective immediately CDATE=2021110618
This parallel is resumed.
Zombie job found with gfs fcst. Rerun using restart RERUN_RESTART/20211111.060000.coupler.res
This parallel is paused due to production switch. Archive job rerun is in progress.
gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107. Debug is in progress.
gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107. Debug is in progress.
@lgannoaa I will look into these jobs, but these jobs are known to be highly reactive to file system issues and in general have longer run times for us versus for NCO. I'm looking to see if there's any red-flags, but likely the wall clock time just needs to be extended and these jobs should re-run to completion within the longer wall clock time.
@emilyhcliu @JessicaMeixner-NOAA May I know who is looking at this job for output at this time? I asked this question because it looks to me, the output of this job gfswave.t00z.ibpcbull_tar and gfswave.t00z.ibpbull_tar are not been archived on HPSS. Can this job be turn off for this parallel?
This parallel is on pause because this job continue to fail for all cycles.
@lgannoaa
Since the failed post job is a known problem in WAVES and the outputs from the job are not used in the following cycles. So, let's skip these jobs and move the parallel forward.
@emilyhcliu Parallel is now resumed on CDATE=2021110812 with all four cycles gfs_wave_post_bndpntbll jobs turned off.
Sounds like a good plan @emilyhcliu
Parallel resumed on CDATE=2021112100 after PTMP full issue is resolved.
gfs wave post bndpntbll jobs for CDATE=2021112212 and 2021112218 was turned on for test requested from helpdesk in helping debug job failure issue. Both of these jobs completed at 40 minutes. These jobs are new resumed to run in this parallel.
This parallel will be paused at the end of CDATE=2021112918 to perform CRTM related update.
This parallel will be paused at the end of CDATE=2021112918 to perform CRTM related update.
See comment in issue #952 for explanation.
Based on 43-day retrospective cycling run, retro1-v16-ecf, the RMS and BIAS against OSTIA foundation temperature analysis has been generated. See the figure, in which, besides retro1-v16-ecf (ECF), other 6 analyses are included as well. OPR is the operational, CMC is CMC Tf analysis, C06, C07 and C08 are experiments done with operational GFS with the NSST update package. The figure is about the global area. We can see, in terms of RMS, retro1-v16-ecf (ECF) is closer to OSTIA significantly (comparable with the Exps done before).
This parallel is paused at the end of CDATE=2021120518 due to transfer job slowness.
Based on 43-day retrospective cycling run, retro1-v16-ecf, the RMS and BIAS against OSTIA foundation temperature analysis has been generated. See the figure, in which, besides retro1-v16-ecf (ECF), other 6 analyses are included as well. OPR is the operational, CMC is CMC Tf analysis, C06, C07 and C08 are experiments done with operational GFS with the NSST update package. The figure is about the global area. We can see, in terms of RMS, retro1-v16-ecf (ECF) is closer to OSTIA significantly (comparable with the Exps done before).
@XuLi-NOAA Thanks for your NSST diagnostics. The results are similar to the test runs you did in June. Great!
On August, 27, 2022, the gfs.v16.3.0 used for parallels was updated with the following changes. These changes do not impact assimilation results.
The AVHRR and VIIRS entries were added to the RadMon utility in the following files: modified: util/Radiance_Monitor/nwprod/gdas_radmon/fix/gdas_radmon_satype.txt modified: util/Radiance_Monitor/nwprod/gdas_radmon/fix/gdas_radmon_scaninfo.txt
These changes will go into gfsda.v16.3.0 along with Russ's bugzilla fixes NCO is aware and expecting the re-tag of gfsda.v16.3.0.
The transfer speed improved over the weekend of 8/26~8/28. It was resumed to run. It is on CDATE=2021121106 as of 8/29 8:00a EST.
Many archive jobs failed with system issue: Connection timed out Rerun in progress.
Zombie and system error caused few jobs to fail. Rerun in progress.
There were 33 archive jobs failed over the night of Aug 30th. Due to the QOS from production jobs. Rerun in progress.
Looks like the HPSS speed improvement is solid on WCOSS2 now. Modify this parallel to write restart files to HPSS everyday. This change is now in place effective CDATE=2021121600.
Still see impacts during the night when production transfer jobs takes higher priority. Some of our transfer jobs gets cancelled by the HPSS system due to slow transfer speed. The HPSS helpdesk respond with acknowledge on the ticket. Therefore, issue with failed transfer jobs is here (on Cactus) to stay.
@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel
@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel
It has been moved to #952 .
Rerun 35 archive jobs due to system issues previously known. Rerun 32 archive jobs due to system issues previously known.
Management requested to run a full cycle with the library updates in GFSv16.3.0. In preparation, the following modification is in plan:
As of the morning on Sep. 7th, the full cycle test is completed. One exception is the gempak job that does not have canned data.
Management has made a decision to only update module bufr_ver to 11.7.0. All other library remain the same as prior to this full cycle run. Therefore, on Sep. 7th. The HOMEgfs has been updated with this change and rebuild. The current parallel is resumed on CDATE=2022010406.
Management has made a decision on update GSI and model package. The GSI package is ready and model package is still pending. This parallel is paused on CDATE=2022010700 to checkout and build GSI package.
Due to the process of switch between using library updates, bufr_ver only and update GSI. The crtm version update was left out. The old version of crtm 2.3.0 is now update to crtm 2.4.0. GSI rebuild with crtm 2.4.0. This parallel is in progress to rerun from 2022010600.
For the retrospective run, we will rewind 14 days and restart on the 2022010600 cycle.
With Lin's revised and improved global-workflow with ecflow and the better HPSS transfer rate, it is not a setback to rewind the parallel run. The most important thing is that we caught the issue, fixed it, and move forward.
There is an emergency production switch on 9/21 morning. There are 15 archive jobs, Metplus jobs and regular jobs failure due to the switch. Debug/rerun/recover is in progress. Impactful jobs is in CDATE= 2022020100, 2022020106, 2022020112.
The ecen 2022020112 failed. The debug effort traced and found the previous cycle 2022020106 job corrupted due to the production switch. Therefore, this parallel is now rewind two cycles. Rerun from 2022020106.
The rerun from 2022020106 resolved the issue.
NCO executed a production switch on 9/22. Cactus is now back to the dev machine. This parallel will resume on CDATE=2022020206.
Update on the NSST foundation temperature analysis performance monitoring in GFSv16.3 retrospective run (retro1-v16-ecf). This is an extension of the figure reported 28days ago. And 5 more areas are included tis time: Global, N.Pole, N.Mid, Tropics, S.Mid, S.Pole. From the figures, we can see, RMS has been improved across the whole period (about 3 and half months). However, there is a worry, i.e, the bias is getting worse, from the global one, the bias was improved in the beginning (about 10 days), then, getting even colder than operational. From the smaller area ones, we can see the issue is mainly occurs in Tropics and S.Mid areas. The NSST package had been tested but never to this long time period. At least, an alert.
Description
This issue is to document the GFS v16.3 retro parallel for implementation. Referenced to #776 @emilyhcliu is the implementation POC
The configuration for this parallel is: First full cycle starting CDATE is retro 2021101518 HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0 pslot: retro1-v16-ecf EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config COM: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/gfs/v16.3 log: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/output/prod/today on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf/fits Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/retro1-v16-ecf (Updated daily at 14:00 UTC on PDY-1) HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/retro1-v16-ecf
FIT2OBS: /lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5 df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)
obsproc: /lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2 83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)
prepobs /lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1 5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)
HOMEMET /apps/ops/para/libs/intel/19.1.3.304/met/9.1.3
METplus /apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1
verif_global /lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd 1aabae3aa (HEAD, tag: verif_global_v2.9.4)