GFS v16.3 retro parallel for implementation

lgannoaa commented 2 years ago

Description

This issue is to document the GFS v16.3 retro parallel for implementation. Referenced to #776 @emilyhcliu is the implementation POC

The configuration for this parallel is: First full cycle starting CDATE is retro 2021101518 HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0 pslot: retro1-v16-ecf EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config COM: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/gfs/v16.3 log: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/output/prod/today on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf/fits Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/retro1-v16-ecf (Updated daily at 14:00 UTC on PDY-1) HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/retro1-v16-ecf

FIT2OBS: /lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5 df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)

obsproc: /lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2 83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)

prepobs /lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1 5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)

HOMEMET /apps/ops/para/libs/intel/19.1.3.304/met/9.1.3

METplus /apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1

verif_global /lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd 1aabae3aa (HEAD, tag: verif_global_v2.9.4)

emilyhcliu commented 2 years ago

Status Update from DA - issues, diagnostics, solution and moving forward

Background

gfs.v16.3.0 retrospective parallel started from 2021101518z on Cactus. So far, we have about 3-4 week results. The overall forecast skills show degradation in NH. The DA team investigated to look for possible causes and solutions. The run configured and maintained by @lgannoaa has been very helpful for the DA team to spot a couple of issues from the gfsda.v16.3.0 package.

Issues, diagnostics, bug fixes, and tests

(1) An initialization problem for satellite bias correction coefficients were found for sensors with coefficients initialized from zero. The quasi-mode initialization procedure was skipped due to a bug merged from the GSI develop to gfs.v16.3.0

The issue and diagnostics are documented in GSI Issue #438 The bug fix is provided in GSI PR #439 The bug fix had been merged into gfsda.v16.3.0

A short gfs.v16.3.0 parallel test (v163t) was performed to verify the bug fix

(2) Increasing NSST biases and RMS of O-F (no bias) are observed in the time seires of AVHRR MetOp-B channel 3 and the window channels from hyperspectral sensors (IASI, CrIS). Foundation temperature bias and rms compared to operational GFS and OSTIA increase with time. It was found that the NSST increment file from GSI was not passing into the global cycle properly.

The issue and diagnostics in detail are documented in GSI Issue #449

The bug fix is documented in GSI PR #448

Test

A short gfs.v16.3.0 real-time parallel (starting from 2022061918z; v163ctl) with the bug fixes from (1) and (2) is currently running on Dogwood to verify the bug fixes.

We will keep this running for a few days....

Here is the link to the Verification page: https://www.emc.ncep.noaa.gov/gc_wmb/eliu/v163ctl/

We should stop the retrospective parallel on Cactus and re-run it with the bug fixes.

lgannoaa commented 2 years ago

NCO announced that Cactus will become dev machine in the evening of Aug 4th. Retro will start with CDATE=2021101518.

lgannoaa commented 2 years ago

Retro is now started on Aug. 4th. evening.

lgannoaa commented 2 years ago

retro paused on CDATE=2021101900 in the morning of Aug. 5th due to HPSS transfer slowness which caused high COM usage. Aug. 5th evening, the transfer speed remain slow. Parallel remain paused.

lgannoaa commented 2 years ago

Tag: @emilyhcliu @dtkleist @junwang-noaa @emilyhcliu and @dtkleist today made a decision to modify this parallel to write restart files and archive to HPSS every 7 days. This change is now in place.

lgannoaa commented 2 years ago

Cactus have multiple system issues include job submit issue, missing jobs, zero size files, and archive jobs disappear on Aug 5th evening. Multiple rerun and clean up was performed. Resumed on CDATE=2021101900.

lgannoaa commented 2 years ago

Cactus has file system issue caused para check job failed. Example message: mkdir: cannot create directory â /retro1-v16-ecf2021101818checkâ : Permission denied Cactus has hpss transfer system issue cause multiple archive job failed. Example error message: Cannot send after transport endpoint shutdown ecen, efcs jobs become zombie jobs. Archive jobs continue to fail after several attempt try to recover the parallel. Therefore, this parallel is paused on CDATE=2021101906 for the remaining weekend.

lgannoaa commented 2 years ago

This parallel is resumed in morning Aug. 8th. Cactus archive job continue impacted with system issue "Cannot send after transport endpoint shutdown". Helpdesk ticket sent: Ticket#2022080810000045 NCO fixed the system issue. The parallel is now resumed. However, due to the system issue some files are already cleanout in PTMP. Caused incomplete archive jobs: Impacted CDATE=2021101518 to 2021101718, 2021101800, 2021101806, 2021101906 and 2021102012, 2021102018. @emilyhcliu agreed to continue the parallel as is in a meeting on Aug 8th.

lgannoaa commented 2 years ago

8/9 increate eupd job by 10 minutes because it has multiple fail due to wall clock.

lgannoaa commented 2 years ago

The transfer speed remain slow over the night on 8/9 until morning 8/10 PTMP reached critical limit because archive job can't finish. Parallel is paused on CDATE=2021102512 until transfer jobs finished. Tag: @emilyhcliu @dtkleist @junwang-noaa

lgannoaa commented 2 years ago

Parallel resumed on CDATE=2021102518 for one cycle. It will be paused on CDATE=2021102600 in preparation of wafs testing. Starting CDATE=2021102518 post is now using new tag update upp_v8.2.0 (02086a8) and wafs is using tag gfs_wafs.v6.3.1 (da909f).

lgannoaa commented 2 years ago

Parallel paused on CDATE=2021102806 due to system error in archive jobs and high PTMP usage. Disk quota exceeded on group PTMP.

lgannoaa commented 2 years ago

Rerun a few archive zombie jobs to keep parallel going and PTMP clean up to continue. At 10;00 EST 8/13 current CDATE=2021103106. Tested WAFS GCIP job on CDATE=2021103100. It failed. Email has been sent to the developer.

lgannoaa commented 2 years ago

PTMP fill up last night, parallel paused for a few hours. It is resumed at CDATE=2021110406.

lgannoaa commented 2 years ago

wafs testing is now completed. Code manager checked output and log found no issue.

lgannoaa commented 2 years ago

Gempak and awips downstream code manager checked output and log on a 00Z test found no issue.

lgannoaa commented 2 years ago

Bufr sounding code manager checked output and log on a 00Z test found no issue.

lgannoaa commented 2 years ago

Parallel paused for a few hours due to transfer job system issue. After rerun 34 jobs, the parallel is now resumed on CDATE=2021110612.

lgannoaa commented 2 years ago

Emergency failover of production to cactus. This parallel is now paused in preparation to run on white space. Cactus is now the production machine. Effective immediately CDATE=2021110618

lgannoaa commented 2 years ago

This parallel is resumed.

lgannoaa commented 2 years ago

Zombie job found with gfs fcst. Rerun using restart RERUN_RESTART/20211111.060000.coupler.res

lgannoaa commented 2 years ago

This parallel is paused due to production switch. Archive job rerun is in progress.

lgannoaa commented 2 years ago

gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107. Debug is in progress.

JessicaMeixner-NOAA commented 2 years ago

gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107. Debug is in progress.

@lgannoaa I will look into these jobs, but these jobs are known to be highly reactive to file system issues and in general have longer run times for us versus for NCO. I'm looking to see if there's any red-flags, but likely the wall clock time just needs to be extended and these jobs should re-run to completion within the longer wall clock time.

@emilyhcliu @JessicaMeixner-NOAA May I know who is looking at this job for output at this time? I asked this question because it looks to me, the output of this job gfswave.t00z.ibpcbull_tar and gfswave.t00z.ibpbull_tar are not been archived on HPSS. Can this job be turn off for this parallel?

This parallel is on pause because this job continue to fail for all cycles.

emilyhcliu commented 2 years ago

@lgannoaa
Since the failed post job is a known problem in WAVES and the outputs from the job are not used in the following cycles. So, let's skip these jobs and move the parallel forward.

@emilyhcliu Parallel is now resumed on CDATE=2021110812 with all four cycles gfs_wave_post_bndpntbll jobs turned off.

JessicaMeixner-NOAA commented 2 years ago

Sounds like a good plan @emilyhcliu

lgannoaa commented 2 years ago

Parallel resumed on CDATE=2021112100 after PTMP full issue is resolved.

lgannoaa commented 2 years ago

gfs wave post bndpntbll jobs for CDATE=2021112212 and 2021112218 was turned on for test requested from helpdesk in helping debug job failure issue. Both of these jobs completed at 40 minutes. These jobs are new resumed to run in this parallel.

lgannoaa commented 2 years ago

This parallel will be paused at the end of CDATE=2021112918 to perform CRTM related update.

emilyhcliu commented 2 years ago

This parallel will be paused at the end of CDATE=2021112918 to perform CRTM related update.

See comment in issue #952 for explanation.

XuLi-NOAA commented 2 years ago

RMS_BIAS_to_ostia_retro_v163_7exps_2021101600_2021112718 Based on 43-day retrospective cycling run, retro1-v16-ecf, the RMS and BIAS against OSTIA foundation temperature analysis has been generated. See the figure, in which, besides retro1-v16-ecf (ECF), other 6 analyses are included as well. OPR is the operational, CMC is CMC Tf analysis, C06, C07 and C08 are experiments done with operational GFS with the NSST update package. The figure is about the global area. We can see, in terms of RMS, retro1-v16-ecf (ECF) is closer to OSTIA significantly (comparable with the Exps done before).

lgannoaa commented 2 years ago

This parallel is paused at the end of CDATE=2021120518 due to transfer job slowness.

emilyhcliu commented 2 years ago

Based on 43-day retrospective cycling run, retro1-v16-ecf, the RMS and BIAS against OSTIA foundation temperature analysis has been generated. See the figure, in which, besides retro1-v16-ecf (ECF), other 6 analyses are included as well. OPR is the operational, CMC is CMC Tf analysis, C06, C07 and C08 are experiments done with operational GFS with the NSST update package. The figure is about the global area. We can see, in terms of RMS, retro1-v16-ecf (ECF) is closer to OSTIA significantly (comparable with the Exps done before).

@XuLi-NOAA Thanks for your NSST diagnostics. The results are similar to the test runs you did in June. Great!

emilyhcliu commented 2 years ago

On August, 27, 2022, the gfs.v16.3.0 used for parallels was updated with the following changes. These changes do not impact assimilation results.

The AVHRR and VIIRS entries were added to the RadMon utility in the following files: modified: util/Radiance_Monitor/nwprod/gdas_radmon/fix/gdas_radmon_satype.txt modified: util/Radiance_Monitor/nwprod/gdas_radmon/fix/gdas_radmon_scaninfo.txt

These changes will go into gfsda.v16.3.0 along with Russ's bugzilla fixes NCO is aware and expecting the re-tag of gfsda.v16.3.0.

lgannoaa commented 2 years ago

The transfer speed improved over the weekend of 8/26~8/28. It was resumed to run. It is on CDATE=2021121106 as of 8/29 8:00a EST.

lgannoaa commented 2 years ago

Many archive jobs failed with system issue: Connection timed out Rerun in progress.

lgannoaa commented 2 years ago

Zombie and system error caused few jobs to fail. Rerun in progress.

lgannoaa commented 2 years ago

There were 33 archive jobs failed over the night of Aug 30th. Due to the QOS from production jobs. Rerun in progress.

lgannoaa commented 2 years ago

Looks like the HPSS speed improvement is solid on WCOSS2 now. Modify this parallel to write restart files to HPSS everyday. This change is now in place effective CDATE=2021121600.

lgannoaa commented 2 years ago

Still see impacts during the night when production transfer jobs takes higher priority. Some of our transfer jobs gets cancelled by the HPSS system due to slow transfer speed. The HPSS helpdesk respond with acknowledge on the ticket. Therefore, issue with failed transfer jobs is here (on Cactus) to stay.

emilyhcliu commented 2 years ago

@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel

XuLi-NOAA commented 2 years ago

@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel

It has been moved to #952 .

lgannoaa commented 2 years ago

Rerun 35 archive jobs due to system issues previously known. Rerun 32 archive jobs due to system issues previously known.

lgannoaa commented 2 years ago

Management requested to run a full cycle with the library updates in GFSv16.3.0. In preparation, the following modification is in plan:

Current HOMEgfs is preserved
Checkout GFSv16.3.0 and apply library updates
Build executable
Modify ecflow workflow to pause on CDATE=2022010400
Resume parallel with the library updates package going forward

As of the morning on Sep. 7th, the full cycle test is completed. One exception is the gempak job that does not have canned data.
Management has made a decision to only update module bufr_ver to 11.7.0. All other library remain the same as prior to this full cycle run. Therefore, on Sep. 7th. The HOMEgfs has been updated with this change and rebuild. The current parallel is resumed on CDATE=2022010406.

lgannoaa commented 2 years ago

Management has made a decision on update GSI and model package. The GSI package is ready and model package is still pending. This parallel is paused on CDATE=2022010700 to checkout and build GSI package.

lgannoaa commented 2 years ago

Due to the process of switch between using library updates, bufr_ver only and update GSI. The crtm version update was left out. The old version of crtm 2.3.0 is now update to crtm 2.4.0. GSI rebuild with crtm 2.4.0. This parallel is in progress to rerun from 2022010600.

emilyhcliu commented 2 years ago

For the retrospective run, we will rewind 14 days and restart on the 2022010600 cycle.
With Lin's revised and improved global-workflow with ecflow and the better HPSS transfer rate, it is not a setback to rewind the parallel run. The most important thing is that we caught the issue, fixed it, and move forward.

lgannoaa commented 2 years ago

There is an emergency production switch on 9/21 morning. There are 15 archive jobs, Metplus jobs and regular jobs failure due to the switch. Debug/rerun/recover is in progress. Impactful jobs is in CDATE= 2022020100, 2022020106, 2022020112.

The ecen 2022020112 failed. The debug effort traced and found the previous cycle 2022020106 job corrupted due to the production switch. Therefore, this parallel is now rewind two cycles. Rerun from 2022020106.

The rerun from 2022020106 resolved the issue.

lgannoaa commented 2 years ago

NCO executed a production switch on 9/22. Cactus is now back to the dev machine. This parallel will resume on CDATE=2022020206.

XuLi-NOAA commented 2 years ago

RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118 Update on the NSST foundation temperature analysis performance monitoring in GFSv16.3 retrospective run (retro1-v16-ecf). This is an extension of the figure reported 28days ago. And 5 more areas are included tis time: Global, N.Pole, N.Mid, Tropics, S.Mid, S.Pole. From the figures, we can see, RMS has been improved across the whole period (about 3 and half months). However, there is a worry, i.e, the bias is getting worse, from the global one, the bias was improved in the beginning (about 10 days), then, getting even colder than operational. From the smaller area ones, we can see the issue is mainly occurs in Tropics and S.Mid areas. The NSST package had been tested but never to this long time period. At least, an alert.

NOAA-EMC / global-workflow