NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 150 forks source link

Iasi debug fix #790

Closed wx20jjung closed 2 months ago

wx20jjung commented 2 months ago

Description

When the GSI is built in debug mode, the code failed when the read_iasi routine called combine_radobs. In tracking down this problem several other minor problems were discovered and fixed.

Resolves #789

The first group of commits are some write format changes I missed in the output of the runtime directory. I also found where the maximum channel number was set to 3000. I changed this array to be dynamic and allocated for each instrument.

The second group of commits resolves 2 errors I found while trying to fix the debug issue. The first was an error some of the IASI temperature values were "NAN"s. I traced this back to some of the cscale values (a scaling factor for IASI radiances, in the BUFR file), were missing values. The missing cscale values are associated with the shortwave side of the water vapor region and the shortwave channels. None of these channels are currently used. These were only found in the direct broadcast data. I did not find an instance where the operational data had missing values. The second error was an integer overflow problem with the variable nread being passed into combine radobs. Nread is used in most satellite instrument reads but the main failure was from read_iasing (to be added later). Nread is a counter and is ultimately number_of_profiles number_of_channels. This number exceeded the memory for an integer and was ultimately a negative number. In combineradobs, nread is compared to the total number of elements of the thinning box or itxmax to the number of elements of the task thinning box. In this case, the task thinning box was negative. I added logic to the various read routines to pass the number of profiles kept on each task. This makes nread and itxmax consistent. This change caused the total counts to be different for some instruments. Here is an example from the gsistat file for IASI. previous: o-g 01 rad metop-b iasi 331026696 6148957 690460 0.24889E+06 0.24889E+06 0.36047 0.36047 new: o-g 01 rad metop-b iasi 44632896 6148957 690460 0.24889E+06 0.24889E+06 0.36047 0.36047 The total counts after thinning and total counts used are identical along with the other statistics.

The actual failure of read_iasi in debug mode came down to putting an if() cycle in a different place. Read_iasi is now consistent with read_cris.

Type of change

How Has This Been Tested?

The main testing and cycling experiments were conducted on S4 at C192 resolution. After a single cycle spinup I setup a "control" and "experiment" and run both independently changing the GSIEXEC variable to point to the gsi master branch (control) and the IASI_debug_fix branch (experiment). The control and experiment were run through 4 cycle. Static tests were also conducted on Jet.

The ctests on Hera all passed: [Jim.Jung@hfe10 build]$ ctest -j 6 Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/save/ctests/update/build Start 1: global_4denvar Start 6: global_enkf Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 492.94 sec 2/6 Test #2: rtma ............................. Passed 970.72 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1043.42 sec 4/6 Test #6: global_enkf ...................... Passed 1118.55 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 1159.69 sec 6/6 Test #1: global_4denvar ................... Passed 1909.17 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1909.21 sec

The rrfs ctest timed out on jet, all others passed. Test project /lfs5/HFIP/hfv3gfs/Jim.Jung/noscrub/ctests/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #6: global_enkf ...................... Passed 1779.50 sec 2/6 Test #2: rtma ............................. Passed 1938.22 sec 3/6 Test #1: global_4denvar ................... Passed 1991.97 sec 4/6 Test #5: hafs_3denvar_hybens .............. Passed 2000.38 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 2310.42 sec

The rrfs test also failed on hercules [jjung@hercules-login-3 build]$ ctest Test project /work/noaa/nesdis-rdo1/jjung/noscrub/ctests/update/build Start 1: global_4denvar 1/6 Test #1: global_4denvar ................... Passed 1741.09 sec Start 2: rtma 2/6 Test #2: rtma ............................. Passed 965.23 sec Start 3: rrfs_3denvar_rdasens 3/6 Test #3: rrfs_3denvar_rdasens .............***Failed 484.58 sec Start 4: hafs_4denvar_glbens 4/6 Test #4: hafs_4denvar_glbens .............. Passed 1161.19 sec Start 5: hafs_3denvar_hybens 5/6 Test #5: hafs_3denvar_hybens .............. Passed 1094.42 sec Start 6: global_enkf 6/6 Test #6: global_enkf ...................... Passed 847.14 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 6293.65 sec

The following tests FAILED: 3 - rrfs_3denvar_rdasens (Failed) The memory for rrfs_3denvar_rdasens_loproc_updat is 1106724 KBs. This has exceeded maximum allowable memory of 1098631 KBs, resulting in Failure memthresh of the regression test.

Checklist

Please add @ADCollard, @DavidHuber-NOAA and @InnocentSouopgui-NOAA as reviewers.

wx20jjung commented 2 months ago

@DavidHuber-NOAA I have been struggling with this. I do not know what the true intent of nread is. Was it developed for 2D (horizontal) observations or 3D data like the radiances? I interpret the code as being for the former. My argument is based on the allocate statement for nloc and icrit a few lines down. Both of these variables use itxmax, which is the total counts of a 2D spatial (horizontal) grid. To be consistent, I feel ncounts1 (and nread) should also be from a 2D spatial grid and not multiplied by the number of channels.

An alternative would be to add another argument to combine_radobs specifically for ncounts and ncounts1, which would all be long integers from and to the various read routines.

DavidHuber-NOAA commented 2 months ago

@wx20jjung I agree with you on the intent; that it was developed for 2D observations and so it is unlikely to every overflow with your changes. That said, I think I like the alternative approach you proposed just to be safe.

wx20jjung commented 2 months ago

As per a conversation with @DavidHuber-NOAA, I've separated the total channel counts read (nread) from the thinning box (itxmax). Now, if the channel counts read becomes larger than the integer size, it will only affect the write statements. I've re-run the ctests on hera.

Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests/update/build Start 2: rtma Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 1: global_4denvar Start 6: global_enkf Start 3: rrfs_3denvar_rdasens 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 496.10 sec 2/6 Test #2: rtma ............................. Passed 972.37 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1106.76 sec 4/6 Test #6: global_enkf ...................... Passed 1159.74 sec 5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1345.07 sec 6/6 Test #1: global_4denvar ................... Passed 1969.79 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 1969.84 sec

The hafs_4denvar_glbens test failed the time limit test. I restarted the test and it passed.

ctest --rerun-failed --output-on-failure Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests/update/build Start 4: hafs_4denvar_glbens 1/1 Test #4: hafs_4denvar_glbens .............. Passed 1348.48 sec 100% tests passed, 0 tests failed out of 1 Total Test time (real) = 1348.51 sec

I will run the ctests on other machines when I get feedback from the rest of the reviewers.

RussTreadon-NOAA commented 2 months ago

WCOSS2 (Dogwood) ctests

Install wx20jjung:IASI_debug_fix at a41a78aa0 on Dogwood. Use develop at 9f44c879 as the control. ctests results are as follows

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr790/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  729.88 sec
2/6 Test #6: global_enkf ......................   Passed  972.64 sec
3/6 Test #5: hafs_3denvar_hybens ..............   Passed  1153.84 sec
4/6 Test #4: hafs_4denvar_glbens ..............   Passed  1334.73 sec
5/6 Test #1: global_4denvar ...................   Passed  1864.83 sec
6/6 Test #2: rtma .............................   Passed  1869.75 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1869.76 sec
wx20jjung commented 2 months ago

I've made the various changes, tested everything both in the current branch (IASI_debug_fix) and in my IASI-NG branch in a single cycle test. Both results are as expected. The various satellite parameters in the gsistat (counts, bias, std dev, etc) are identical with (iasi_debug_fix) and without (develop) these changes. I found no more '****' fields in any of the output.

wx20jjung commented 2 months ago

ctests on herculese completed as expected [jjung@hercules-login-4 build]$ ctest -j 6 Test project /work/noaa/nesdis-rdo1/jjung/noscrub/ctests/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 613.04 sec 2/6 Test #6: global_enkf ...................... Passed 785.35 sec 3/6 Test #2: rtma ............................. Passed 1084.97 sec 4/6 Test #5: hafs_3denvar_hybens .............. Passed 1152.51 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 1278.23 sec 6/6 Test #1: global_4denvar ................... Passed 1741.15 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1741.16 sec

Results from jet had 2 problems: Test project /lfs5/HFIP/hfv3gfs/Jim.Jung/noscrub/ctests/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #6: global_enkf ...................... Passed 1225.07 sec 2/6 Test #2: rtma ............................. Passed 1391.81 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1637.28 sec 4/6 Test #4: hafs_4denvar_glbens ..............***Failed 1697.58 sec 5/6 Test #1: global_4denvar ................... Passed 2410.05 sec

The rrfs tests timed out and did not complete. The runtime for hafs_4denvar_glbens_loproc_updat is 335.324241 seconds. This has exceeded maximum allowable threshold time of 328.320275 seconds, resulting in Failure time-thresh of the regression test.

My hera tests have been in the queue for the past 12 hours. I removed them.

RussTreadon-NOAA commented 2 months ago

WCOSS2 (Cactus) ctests Install wx20jjung:IASI_debug_fix at 74947da50 on Dogwood. Run ctests with following results:

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr790/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  734.97 sec
2/6 Test #6: global_enkf ......................   Passed  864.32 sec
3/6 Test #2: rtma .............................   Passed  971.09 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1215.19 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1334.75 sec
6/6 Test #1: global_4denvar ...................   Passed  1683.76 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1683.77 sec