NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 151 forks source link

fix for cris-fsr memory corruption (issue #755) #767

Closed jswhit closed 4 months ago

jswhit commented 4 months ago

one line change from @jderber-NOAA to fix memory corruption issue in correlated_obsmod.F90 for cris-fsr data found in reanalysis scout runs. Detailed description found in issue #755.

Resolves #755

Checklist

RussTreadon-NOAA commented 4 months ago

@ADCollard, @emilyhcliu , @xincjin-NOAA , or @azadeh-gh : If any of you have time, would you please review this simple one line change from @jderber-NOAA ? John fixed a bug in correlated_obsmod.F90 for cris-fsr data.

If none of you have time to review the change, any suggestions as to who we can ask?

emilyhcliu commented 4 months ago

It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.

https://github.com/NOAA-EMC/GSI/blob/529bb796bea0e490f186729cd168a91c034bb12d/src/gsi/correlated_obsmod.F90#L391-L425

RussTreadon-NOAA commented 4 months ago

WCOSS2 ctests Build jswhit:cris-fsr-fix at 7c63e068e on Dogwood. Run ctests against develop at 529bb796. All tests pass

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr767/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  726.99 sec
2/6 Test #6: global_enkf ......................   Passed  850.40 sec
3/6 Test #2: rtma .............................   Passed  977.23 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1155.09 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1274.74 sec
6/6 Test #1: global_4denvar ...................   Passed  1682.85 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1682.86 sec
xincjin-NOAA commented 4 months ago

It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.

https://github.com/NOAA-EMC/GSI/blob/529bb796bea0e490f186729cd168a91c034bb12d/src/gsi/correlated_obsmod.F90#L391-L425

Does this mean that the nch_active should be equal to coun?

jderber-NOAA commented 4 months ago

Emily, That is correct😁.  And the nch_active is less than the number of active channels from the satinfo file.  This is only true for the npp fsr data.  This is also the only case where the code will fail and the change is necessary.JohnSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: emilyhcliu @.> Date: 7/8/24 11:03 AM (GMT-05:00) To: NOAA-EMC/GSI @.> Cc: jderber-NOAA @.>, Mention @.> Subject: Re: [NOAA-EMC/GSI] fix for cris-fsr memory corruption (issue #755) (PR #767) It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file. https://github.com/NOAA-EMC/GSI/blob/529bb796bea0e490f186729cd168a91c034bb12d/src/gsi/correlated_obsmod.F90#L391-L425

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

emilyhcliu commented 4 months ago

It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file. https://github.com/NOAA-EMC/GSI/blob/529bb796bea0e490f186729cd168a91c034bb12d/src/gsi/correlated_obsmod.F90#L391-L425

Does this mean that the nch_active should be equal to coun?

I think so!

jderber-NOAA commented 4 months ago

I think with this change coun and nch_active do not have to be equal. If coun < nch_active the code will delete those parts of the correlation matrix that use channels not in coun. If nch_active < coun then the code will add the additional channels found in coun into the covariance matrix using the variances (no correlated error) in the satinfo file (not positive where the variances come from).

The original code was set up to do this, just the one array was too small (when coun > nch_active) and created the overwriting of the memory.

RussTreadon-NOAA commented 4 months ago

@jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop.

jderber-NOAA commented 4 months ago

I am.Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: RussTreadon-NOAA @.> Date: 7/9/24 8:42 AM (GMT-05:00) To: NOAA-EMC/GSI @.> Cc: jderber-NOAA @.>, Mention @.> Subject: Re: [NOAA-EMC/GSI] fix for cris-fsr memory corruption (issue #755) (PR #767) @jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

emilyhcliu commented 4 months ago

@jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop.

@RussTreadon-NOAA Yes!!

RussTreadon-NOAA commented 4 months ago

Hera & Hercules ctests Install jswhit:cris-fsr-fix at 7c63e06 on Hera and Hercules. Run ctests with following results

Hera

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  614.03 sec
2/6 Test #6: global_enkf ......................   Passed  814.13 sec
3/6 Test #2: rtma .............................   Passed  972.84 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1107.29 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1222.67 sec
6/6 Test #1: global_4denvar ...................***Failed  2046.18 sec

50% tests passed, 3 tests failed out of 6

Total Test time (real) = 2046.22 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)

The rrfs_3denvar_rdasens failure is due to

The runtime for rrfs_3denvar_rdasens_loproc_updat is 157.287149 seconds.  This has exceeded maximum allowable threshold ti\
me of 129.578203 seconds, resulting in Failure time-thresh of the regression test.

A check of the gsi.x wall times confirms that the updat wall times exceed the contrl

rrfs_3denvar_rdasens_hiproc_contrl/stdout:The total amount of wall time                        = 86.790942
rrfs_3denvar_rdasens_hiproc_updat/stdout:The total amount of wall time                        = 108.434967
rrfs_3denvar_rdasens_loproc_contrl/stdout:The total amount of wall time                        = 86.385469
rrfs_3denvar_rdasens_loproc_updat/stdout:The total amount of wall time                        = 157.287149

Correlate observation error for radiances is not active in the rrfs_3denvar_rdasens test. The correction of the array allocation bug is the only difference between updat and contrl. Thus, the above timing difference is not due to the bug fix since this code is not exercised.

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 314.056492 seconds.  This has exceeded maximum allowable threshold time of 312.580009 seconds, resulting in Failure time-thresh of the regression test.

This is not a fatal fail.

The global_4denvar failed due to

The runtime for global_4denvar_hiproc_updat is 483.159129 seconds.  This has exceeded maximum allowable threshold time of 400.916883 seconds, resulting in Failure of timethresh2 the regression test.

This is not a fatal fail.

A rerun of the failed tests resulted in Passed for each test

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 1: global_4denvar
1/1 Test #1: global_4denvar ...................   Passed  1986.62 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1986.64 sec

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 3: rrfs_3denvar_rdasens
1/1 Test #3: rrfs_3denvar_rdasens .............   Passed  496.82 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 497.23 sec

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
    Start 4: hafs_4denvar_glbens
1/1 Test #4: hafs_4denvar_glbens ..............   Passed  1287.73 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1287.74 sec

Hercules

Test project /work/noaa/da/rtreadon/git/gsi/pr767/build/regression
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  486.98 sec
2/6 Test #6: global_enkf ......................***Failed  729.88 sec
3/6 Test #2: rtma .............................***Failed  965.90 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1153.19 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed  1218.99 sec
6/6 Test #1: global_4denvar ...................   Passed  1681.87 sec

33% tests passed, 4 tests failed out of 6

Total Test time (real) = 1681.88 sec

The following tests FAILED:
          2 - rtma (Failed)
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
          6 - global_enkf (Failed)

The rrfs_3denvar_rdasens failure is due to

The runtime for rrfs_3denvar_rdasens_hiproc_updat is 105.944369 seconds.  This has exceeded maximum allowable threshold time of 104.912751 seconds, resulting in Failure of timethresh2 the regression test.

This is not a fatal fail.

The global_enkf failure is due to

The runtime for global_enkf_loproc_updat is 160.504755 seconds.  This has exceeded maximum allowable threshold time of 157.186910 seconds, resulting in Failure timethresh of the regression test.

This is not a fatal fail.

The rtma failure is due to

The runtime for rtma_loproc_updat is 215.266764 seconds.  This has exceeded maximum allowable threshold time of 211.326822 seconds, resulting in Failure time-thresh of the regression test.

This is not a fatal fail.

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 352.022764 seconds.  This has exceeded maximum allowable threshold time of 292.013010 seconds, resulting in Failure time-thresh of the regression test.

This is not a fatal fail.

A rerun of the ctests on Hercules using /work2 as the run directory yields Passed for two of the four failed tests

Test project /work/noaa/da/rtreadon/git/gsi/pr767/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #6: global_enkf ......................   Passed  751.87 sec
2/6 Test #2: rtma .............................   Passed  966.78 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  1098.22 sec
4/6 Test #4: hafs_4denvar_glbens ..............***Failed  1102.22 sec
5/6 Test #1: global_4denvar ...................   Passed  2222.37 sec

hafs_4denvar_glbens fails due to a timing and reproducibility checks. This is interesting. The previous run of this test in /work only failed due to timing checks. The analysis results were identical. GSI issue #697 discusses non-reproducible results from regional ctests on Hercules. Today's test indicates that this problem remains.

rrfs_3denvar_rdasens did not finish due to the intermittent gsi.x hang reported in GSI issue #766.