NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 147 forks source link

Merge master into release/gfsda.v16.x #237

Closed RussTreadon-NOAA closed 2 years ago

RussTreadon-NOAA commented 2 years ago

Branch release/gfsda.v16.x does not contain several important updates in the master. This issue is opened to document the merger of the authoritative masterinto the authoritative release/gfsda.v16.x.

RussTreadon-NOAA commented 2 years ago

The merger of the masterinto release/gfsda.v16.x will be done in @RussTreadon-NOAA 's forked copy of release/gfsda.v16.x. At the start of this work, the authoritative release/gfsda.v16.x and forked copy are at 3898ab4. The authoritative master is at 2f28fbf.

RussTreadon-NOAA commented 2 years ago

Comments regarding merge of master into release/gfsda.v16.x

git merge of the master at 2f28fbf into the forked release/gfsda.v16.x at 3898ab4 yielded

After resolving the 42 conflicts the working copy of the merged gfsda.v16.x, the authoritative gfsda.v16.x, and the authoritative master were built on Venus. The 2021101900 gdas case was run using each executable via a standalone rungsi script.

The J table at the start of the first outer loop shows the merged gfsda.v16.x produces J terms comparable to the authoritative gfsda.v16.x. Listed below are the total J Global for each executable for this case

The first 13 digits of J Global are identical. Differences in the last few digits of penalty terms were observed when hpc-stack was merged into the master. The authoritative gfsda.v16.x is not built with hpc-stack. The merged gfsda.v16.x is built with hpc-stack.

The J table at the start of the first outer loop shows the merged gfsda.v16.x produces J terms identical with those from the master except for ozone and radiances.

Haixia explained the differences in the ozone penalty. The authoritative gfsda.v16.x contains additional qc for ompstc8

 !       Check scan position errors in ompstc8
         if(obstype == "ompstc8") then
           if(data(ifovn,i) .eq. 1 .or. data(ifovn,i) .eq. 2 .or. &
              data(ifovn,i) .eq. 3 .or. data(ifovn,i) .eq. 4 .or. &
              data(ifovn,i) .eq. 35) then
             if(abs(data(ilate,i)) > 50.)then
               luse(i) = .false.
             endif
           endif
         endif

This qc is absent from the master. A check of fort.206 confirms that the merged gfsda.v16.x and master ozone penalties only differ for ompstc8 when using the same global_ozinfo.txt.

Differences in the radiances are due to different Rcov files being used by the master and merged gfsda.v16.x. The merged gfsda.v16 uses the Rcov files documented in #233. These Rcov files are associated with release/gfsda.v16.1.5 with changes noted in #233. The master Rcov files pre-date release/gfsda.v16.1.5. Comparison of the two fort.207 files shows them to be identical apart from metop-b_iasi and n20_cris-fsr when using the same global_satinfo.txt.

One item of note is that the authoritative master contains optimizations not present in the authoritative gfsda.v16.x. The merged gfsda.v16.x contains these optimizations. The wall time reduction from the optimizations is sizeable.

Given the above results, the working copy of the merged gfsda.v16.x will be committed to @RussTreadon-NOAA 's forked copy of release/gfsda.v16.x.

RussTreadon-NOAA commented 2 years ago

Regression testing

Build authoritative master at 2f28fbf and forked release/gfsda.v16.x at 81ff90c on WCOSS_D (Mars). Run standard suite of regression tests with results shown below:

[emc.glopara@m71a3 build]$ ctest -j 19
Test project /gpfs/dell2/emc/modeling/noscrub/emc.glopara/git/gsi/master/build
      Start  1: global_T62
      Start  2: global_T62_ozonly
      Start  3: global_4dvar_T62
      Start  4: global_4denvar_T126
      Start  5: global_fv3_4denvar_T126
      Start  6: global_fv3_4denvar_C192
      Start  7: global_lanczos_T62
      Start  8: arw_netcdf
      Start  9: arw_binary
      Start 10: nmm_binary
      Start 11: nmm_netcdf
      Start 12: nmmb_nems_4denvar
      Start 13: hwrf_nmm_d2
      Start 14: hwrf_nmm_d3
      Start 15: rtma
      Start 16: global_enkf_T62
      Start 17: netcdf_fv3_regional
      Start 18: global_C96_fv3aero
      Start 19: global_C96_fv3aerorad
 1/19 Test  #8: arw_netcdf .......................   Passed  244.28 sec
 2/19 Test  #2: global_T62_ozonly ................   Passed  364.72 sec
 3/19 Test #18: global_C96_fv3aero ...............   Passed  366.80 sec
 4/19 Test #17: netcdf_fv3_regional ..............   Passed  484.43 sec
 5/19 Test #11: nmm_netcdf .......................   Passed  484.54 sec
 6/19 Test  #9: arw_binary .......................   Passed  484.64 sec
 7/19 Test #16: global_enkf_T62 ..................   Passed  727.60 sec
 8/19 Test #13: hwrf_nmm_d2 ......................   Passed  849.94 sec
 9/19 Test #10: nmm_binary .......................   Passed  853.87 sec
10/19 Test #14: hwrf_nmm_d3 ......................   Passed  854.93 sec
11/19 Test  #3: global_4dvar_T62 .................   Passed  1204.33 sec
12/19 Test #15: rtma .............................   Passed  1451.49 sec
13/19 Test #12: nmmb_nems_4denvar ................   Passed  1477.54 sec
14/19 Test  #7: global_lanczos_T62 ...............   Passed  1924.02 sec
15/19 Test  #4: global_4denvar_T126 ..............   Passed  2284.32 sec
16/19 Test  #1: global_T62 .......................   Passed  3244.07 sec
17/19 Test  #5: global_fv3_4denvar_T126 ..........***Failed  3365.40 sec
18/19 Test  #6: global_fv3_4denvar_C192 ..........   Passed  3524.82 sec
19/19 Test #19: global_C96_fv3aerorad ............***Failed  4206.32 sec

89% tests passed, 2 tests failed out of 19

Total Test time (real) = 4206.38 sec

The following tests FAILED:
          5 - global_fv3_4denvar_T126 (Failed)
         19 - global_C96_fv3aerorad (Failed)
Errors while running CTest

Check regression_results.txt for failed tests.

global_C96_fv3aerorad failed due to the job wall time exceeding the specified limit of 1200 seconds

The runtime for global_C96_fv3aerorad_loproc_updat is 1277.273419 seconds.  This has exceeded maximum allowable operational time of 1200 seconds,
resulting in Failure of max-time in the regression test.

The loproc_contrl (master) test also had a wall time, 1278.910497 seconds, greater than 1200 seconds. This is not a fatal failure.

global_fv3_4denvar_T126 failed with non-reproducible results between the two global_gsi.x executables

The results between the two runs are nonreproducible,
thus the regression test has Failed on cost for global_fv3_4denvar_T126_loproc_updat and global_fv3_4denvar_T126_loproc_contrl analyses.

The differences are real and explainable.

release/gfsda.v16.x includes the correlated error changes described in issue #233. This set of changes is not in the authoritative master. Comparison of the contrl and updat initial penalties show all penalty terms to be identical for the 17 printed digits except radiances. A diff of the contrl and updat fort.207 show differences limited to metop-b iasi and n20 cris-fsr - satellite/sensors for which correlated error is applied.

It should be noted that the loproc and hiproc runs for the contrl are reproducible. The same is true for the loproc and hiproc runs of the updat.