NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 150 forks source link

add safeguard to thompson_reff #779

Closed RussTreadon-NOAA closed 3 months ago

RussTreadon-NOAA commented 3 months ago

Description This PR adds safeguards to subroutine thompson_reff to ensure the ice and rain number concentrations, ni and nr, respectively are greater than zero. With this additional check the global_4denvar ctest runs to completion using the debug gsi.x.

An additional change is to remove an extraneous debug print identified by @wx20jjung.

Resolves #777

Type of change

How Has This Been Tested? Build debug gsi.x and run global_4denvar ctest. Test runs to completion.

Checklist

RussTreadon-NOAA commented 3 months ago

@azadeh-gh and @emilyhcliu : I understand that you are testing the proposed changes to ensure minimal impact on the analysis. If you find that the changes in this PR are insufficient or need revision we can either abandon this PR or I can add your changes to this PR.

RussTreadon-NOAA commented 3 months ago

WCOSS2 ctests Install RussTreadon-NOAA/feature/thompson_reff at 408917ec on Cactus. Install develop at e82365d9. Run ctests with the following results.

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/thompson/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  849.08 sec
2/6 Test #6: global_enkf ......................   Passed  886.57 sec
3/6 Test #2: rtma .............................   Passed  993.37 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1351.57 sec
5/6 Test #5: hafs_3denvar_hybens ..............   Passed  1352.26 sec
6/6 Test #1: global_4denvar ...................***Failed  1707.83 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 1707.91 sec

The following tests FAILED:
          1 - global_4denvar (Failed)

The global_4denvar failure is expected.

The results (penalty) between the two runs are nonreproducible,
thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

The change to crtm_interface.f90 in feature/thompson_reff alters the effective radius calculation for cloud ice and rain. This change is not in the contrl (develop). Given the change in the effective radius, the updat and contrl gsi.x generate different analyses.

emilyhcliu commented 3 months ago

@RussTreadon-NOAA The safeguard you added are totally reasonable. It only checked qx > 0 before the calculation, but for Thompson, check nr and ni should be added.

With the safeguard added, the global_4denvar failed due to non-reproducible is expected. The overall impact of the safeguard should be small.

RussTreadon-NOAA commented 3 months ago

Thank you @emilyhcliu for the review and approval.

RussTreadon-NOAA commented 3 months ago

WCOSS2 debug ctests Repeat the above WCOSS2 ctests on Cactus but compile feature/thompson_reff and develop in debug mode. Run global_4denvar ctest with following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/thompson/build
    Start 1: global_4denvar
1/1 Test #1: global_4denvar ...................***Failed  23576.70 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 23576.80 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
Errors while running CTest

The failure is due to the contrl (develop) debug gsi.x aborting with traceback

Image              PC                Routine            Line        Source
gsi.x              0000000007F31F4B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014C75BA848C0  Unknown               Unknown  Unknown
libimf.so          000014C75BB8AAAF  __libm_log_l9         Unknown  Unknown
gsi.x              00000000008853DC  crtm_interface_mp        2773  crtm_interface.f90
gsi.x              000000000078BEBD  crtm_interface_mp        1881  crtm_interface.f90
gsi.x              0000000005612D45  rad_setup_mp_setu         919  setuprad.f90
gsi.x              000000000400CE99  gsi_radoper_mp_se         100  gsi_radOper.F90
gsi.x              0000000002673C76  setuprhsall_              492  setuprhsall.f90
gsi.x              0000000003F6C9F2  glbsoi_                   323  glbsoi.f90
gsi.x              00000000010A56D0  gsisub_                   200  gsisub.F90
gsi.x              000000000042CBB5  gsimod_mp_gsimain        2431  gsimod.F90
gsi.x              0000000000413B3B  MAIN__                    633  gsimain.f90

Line 2773 of crtm_interace.f90 is the lab_i line mentioned in issue #777

        if (qx > qmin) then
           lam_i=exp(1.0_r_kind / 3.0_r_kind * log((am_i*ni(k) *gamma(mu_i + 3.0_r_kind + 1.0_r_kind))/(qx*gamma(mu_i+1.0_r_kind))))

In contrast the updat debug gsi.x ran to completion for both the loproc and hiproc configurations

russ.treadon@clogin02:/lfs/h2/emc/ptmp/russ.treadon/thompson/tmpreg_global_4denvar> grep wall */stdout
global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 5336.354999
global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 11028.922376

The feature/thompson_reff crtm_interface.f90 ensures the cloud ice and rain number concentrations, ni and nr respectively, are greater than zero before entering the lam_i and lam_r blocks.

emilyhcliu commented 3 months ago

@RussTreadon-NOAA @azadeh-gh would like to add some comments here.

RussTreadon-NOAA commented 3 months ago

Thank, you @emilyhcliu for the heads up. @azadeh-gh please feel free to add comments here. I do not plan on merging this PR into develop until Monday, 8/12/2024.

azadeh-gh commented 3 months ago

@RussTreadon-NOAA Thank you Russ. I found minimum threshold 1.0e-6_r_kind for ni and nr in subroutine calc_effectRad in ccpp-physics. I think it's better to change 0 to 1.0e-6_r_kind to be consistent with the model physics.

RussTreadon-NOAA commented 3 months ago

@azadeh-gh , your suggestion has been committed to feature/thompson_reff. Done at 9a3a90d. If the modification is satisfactory, please approve this PR.

azadeh-gh commented 3 months ago

@azadeh-gh , your suggestion has been committed to feature/thompson_reff. Done at 9a3a90d. If the modification is satisfactory, please approve this PR.

@RussTreadon-NOAA Thank you!

RussTreadon-NOAA commented 3 months ago

Thank you @azadeh-gh for the quick action. As a final check I will rerun the global_4denvar ctest using the optimized and debug gsi.x on Cactus to ensure the previous ctest results remain valid. I still hope to merge this PR into develop on Monday, 8/12/2024.

RussTreadon-NOAA commented 3 months ago

WCOSS2 tests Build RussTreadon-NOAA:feature/thompson_reff at 9a3a90d and develop at e82365d on Cactus.

The optimized build yields following ctest results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/thompson/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  728.11 sec
2/6 Test #6: global_enkf ......................   Passed  850.39 sec
3/6 Test #2: rtma .............................   Passed  968.95 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1152.72 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1213.02 sec
6/6 Test #1: global_4denvar ...................***Failed  1683.10 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 1683.12 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
Errors while running CTest

The global_4denvar failure is due to non-reproducible results.

The results (penalty) between the two runs are nonreproducible,
thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

Different analysis results are expected. This PR adds safeguards to the effective radius calculation in crtm_interface.f90 which screen out points with cloud ice and rain number concentrations less than the ccpp-physics minimum of 1.0e-6. This change is not in develop.

Rebuild gsi.x in debug mode and run global_4denvar ctest. The feature/thompson_reff debug gsi.x ran to completion in the loproc and hiproc configurations.

russ.treadon@clogin07:/lfs/h2/emc/ptmp/russ.treadon/thompson_debug/tmpreg_global_4denvar> grep wall */stdout
global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 5414.495874
global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 10779.418185

The develop debug gsi.x aborted on line 2773 of crtm_interface.f90 .

Image              PC                Routine            Line        Source
gsi.x              0000000007F31F4B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014DE64D8B8C0  Unknown               Unknown  Unknown
libimf.so          000014DE64E91AAF  __libm_log_l9         Unknown  Unknown
gsi.x              00000000008853DC  crtm_interface_mp        2773  crtm_interface.f90
gsi.x              000000000078BEBD  crtm_interface_mp        1881  crtm_interface.f90
gsi.x              0000000005612D45  rad_setup_mp_setu         919  setuprad.f90
gsi.x              000000000400CE99  gsi_radoper_mp_se         100  gsi_radOper.F90
gsi.x              0000000002673C76  setuprhsall_              492  setuprhsall.f90
gsi.x              0000000003F6C9F2  glbsoi_                   323  glbsoi.f90
gsi.x              00000000010A56D0  gsisub_                   200  gsisub.F90
gsi.x              000000000042CBB5  gsimod_mp_gsimain        2431  gsimod.F90
gsi.x              0000000000413B3B  MAIN__                    633  gsimain.f90
gsi.x              0000000000413992  Unknown               Unknown  Unknown
libc-2.31.so       000014DE64A6324D  __libc_start_main     Unknown  Unknown
gsi.x              00000000004138AA  Unknown               Unknown  Unknown
nid001356.cactus.wcoss2.ncep.noaa.gov: rank 46 died from signal 6 and dumped core

The cloud ice number concentration can be 0.0. This results in log(0), an invalid operation in the develop debug gsi.x. This PR resolves this problem via the additional safeguards added to crtm_interface.f90.