Closed jswhit closed 4 months ago
@ADCollard, @emilyhcliu , @xincjin-NOAA , or @azadeh-gh : If any of you have time, would you please review this simple one line change from @jderber-NOAA ? John fixed a bug in correlated_obsmod.F90
for cris-fsr data.
If none of you have time to review the change, any suggestions as to who we can ask?
It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.
WCOSS2 ctests
Build jswhit:cris-fsr-fix
at 7c63e068e on Dogwood. Run ctests against develop
at 529bb796. All tests pass
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr767/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 726.99 sec
2/6 Test #6: global_enkf ...................... Passed 850.40 sec
3/6 Test #2: rtma ............................. Passed 977.23 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1155.09 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 1274.74 sec
6/6 Test #1: global_4denvar ................... Passed 1682.85 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1682.86 sec
It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file.
Does this mean that the nch_active should be equal to coun?
Emily, That is correct😁. And the nch_active is less than the number of active channels from the satinfo file. This is only true for the npp fsr data. This is also the only case where the code will fail and the change is necessary.JohnSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: emilyhcliu @.> Date: 7/8/24 11:03 AM (GMT-05:00) To: NOAA-EMC/GSI @.> Cc: jderber-NOAA @.>, Mention @.> Subject: Re: [NOAA-EMC/GSI] fix for cris-fsr memory corruption (issue #755) (PR #767) It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file. https://github.com/NOAA-EMC/GSI/blob/529bb796bea0e490f186729cd168a91c034bb12d/src/gsi/correlated_obsmod.F90#L391-L425
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
It seems that the number of active channels (nch_active) stored in the correlated obs file is not consistent with the number of active channels (counts) from the sat info file. https://github.com/NOAA-EMC/GSI/blob/529bb796bea0e490f186729cd168a91c034bb12d/src/gsi/correlated_obsmod.F90#L391-L425
Does this mean that the nch_active should be equal to coun?
I think so!
I think with this change coun and nch_active do not have to be equal. If coun < nch_active the code will delete those parts of the correlation matrix that use channels not in coun. If nch_active < coun then the code will add the additional channels found in coun into the covariance matrix using the variances (no correlated error) in the satinfo file (not positive where the variances come from).
The original code was set up to do this, just the one array was too small (when coun > nch_active) and created the overwriting of the memory.
@jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop
.
I am.Sent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: RussTreadon-NOAA @.> Date: 7/9/24 8:42 AM (GMT-05:00) To: NOAA-EMC/GSI @.> Cc: jderber-NOAA @.>, Mention @.> Subject: Re: [NOAA-EMC/GSI] fix for cris-fsr memory corruption (issue #755) (PR #767) @jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into develop.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
@jderber-NOAA and @emilyhcliu : Are we good with the changes in this PR? If "yes", I will work with the GSI Review Committee to merge this PR into
develop
.
@RussTreadon-NOAA Yes!!
Hera & Hercules ctests
Install jswhit:cris-fsr-fix
at 7c63e06
on Hera and Hercules. Run ctests with following results
Hera
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed 614.03 sec
2/6 Test #6: global_enkf ...................... Passed 814.13 sec
3/6 Test #2: rtma ............................. Passed 972.84 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1107.29 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1222.67 sec
6/6 Test #1: global_4denvar ...................***Failed 2046.18 sec
50% tests passed, 3 tests failed out of 6
Total Test time (real) = 2046.22 sec
The following tests FAILED:
1 - global_4denvar (Failed)
3 - rrfs_3denvar_rdasens (Failed)
4 - hafs_4denvar_glbens (Failed)
The rrfs_3denvar_rdasens
failure is due to
The runtime for rrfs_3denvar_rdasens_loproc_updat is 157.287149 seconds. This has exceeded maximum allowable threshold ti\
me of 129.578203 seconds, resulting in Failure time-thresh of the regression test.
A check of the gsi.x
wall times confirms that the updat wall times exceed the contrl
rrfs_3denvar_rdasens_hiproc_contrl/stdout:The total amount of wall time = 86.790942
rrfs_3denvar_rdasens_hiproc_updat/stdout:The total amount of wall time = 108.434967
rrfs_3denvar_rdasens_loproc_contrl/stdout:The total amount of wall time = 86.385469
rrfs_3denvar_rdasens_loproc_updat/stdout:The total amount of wall time = 157.287149
Correlate observation error for radiances is not active in the rrfs_3denvar_rdasens
test. The correction of the array allocation bug is the only difference between updat and contrl. Thus, the above timing difference is not due to the bug fix since this code is not exercised.
The hafs_4denvar_glbens
failure is due to
The runtime for hafs_4denvar_glbens_loproc_updat is 314.056492 seconds. This has exceeded maximum allowable threshold time of 312.580009 seconds, resulting in Failure time-thresh of the regression test.
This is not a fatal fail.
The global_4denvar
failed due to
The runtime for global_4denvar_hiproc_updat is 483.159129 seconds. This has exceeded maximum allowable threshold time of 400.916883 seconds, resulting in Failure of timethresh2 the regression test.
This is not a fatal fail.
A rerun of the failed tests resulted in Passed for each test
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
Start 1: global_4denvar
1/1 Test #1: global_4denvar ................... Passed 1986.62 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 1986.64 sec
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
Start 3: rrfs_3denvar_rdasens
1/1 Test #3: rrfs_3denvar_rdasens ............. Passed 496.82 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 497.23 sec
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr767/build
Start 4: hafs_4denvar_glbens
1/1 Test #4: hafs_4denvar_glbens .............. Passed 1287.73 sec
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 1287.74 sec
Hercules
Test project /work/noaa/da/rtreadon/git/gsi/pr767/build/regression
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed 486.98 sec
2/6 Test #6: global_enkf ......................***Failed 729.88 sec
3/6 Test #2: rtma .............................***Failed 965.90 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1153.19 sec
5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1218.99 sec
6/6 Test #1: global_4denvar ................... Passed 1681.87 sec
33% tests passed, 4 tests failed out of 6
Total Test time (real) = 1681.88 sec
The following tests FAILED:
2 - rtma (Failed)
3 - rrfs_3denvar_rdasens (Failed)
4 - hafs_4denvar_glbens (Failed)
6 - global_enkf (Failed)
The rrfs_3denvar_rdasens
failure is due to
The runtime for rrfs_3denvar_rdasens_hiproc_updat is 105.944369 seconds. This has exceeded maximum allowable threshold time of 104.912751 seconds, resulting in Failure of timethresh2 the regression test.
This is not a fatal fail.
The global_enkf
failure is due to
The runtime for global_enkf_loproc_updat is 160.504755 seconds. This has exceeded maximum allowable threshold time of 157.186910 seconds, resulting in Failure timethresh of the regression test.
This is not a fatal fail.
The rtma
failure is due to
The runtime for rtma_loproc_updat is 215.266764 seconds. This has exceeded maximum allowable threshold time of 211.326822 seconds, resulting in Failure time-thresh of the regression test.
This is not a fatal fail.
The hafs_4denvar_glbens
failure is due to
The runtime for hafs_4denvar_glbens_loproc_updat is 352.022764 seconds. This has exceeded maximum allowable threshold time of 292.013010 seconds, resulting in Failure time-thresh of the regression test.
This is not a fatal fail.
A rerun of the ctests on Hercules using /work2
as the run directory yields Passed for two of the four failed tests
Test project /work/noaa/da/rtreadon/git/gsi/pr767/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #6: global_enkf ...................... Passed 751.87 sec
2/6 Test #2: rtma ............................. Passed 966.78 sec
3/6 Test #5: hafs_3denvar_hybens .............. Passed 1098.22 sec
4/6 Test #4: hafs_4denvar_glbens ..............***Failed 1102.22 sec
5/6 Test #1: global_4denvar ................... Passed 2222.37 sec
hafs_4denvar_glbens
fails due to a timing and reproducibility checks. This is interesting. The previous run of this test in /work
only failed due to timing checks. The analysis results were identical. GSI issue #697 discusses non-reproducible results from regional ctests on Hercules. Today's test indicates that this problem remains.
rrfs_3denvar_rdasens
did not finish due to the intermittent gsi.x
hang reported in GSI issue #766.
one line change from @jderber-NOAA to fix memory corruption issue in correlated_obsmod.F90 for cris-fsr data found in reanalysis scout runs. Detailed description found in issue #755.
Resolves #755
Checklist