Closed wx20jjung closed 2 months ago
@DavidHuber-NOAA I have been struggling with this. I do not know what the true intent of nread is. Was it developed for 2D (horizontal) observations or 3D data like the radiances? I interpret the code as being for the former. My argument is based on the allocate statement for nloc and icrit a few lines down. Both of these variables use itxmax, which is the total counts of a 2D spatial (horizontal) grid. To be consistent, I feel ncounts1 (and nread) should also be from a 2D spatial grid and not multiplied by the number of channels.
An alternative would be to add another argument to combine_radobs specifically for ncounts and ncounts1, which would all be long integers from and to the various read routines.
@wx20jjung I agree with you on the intent; that it was developed for 2D observations and so it is unlikely to every overflow with your changes. That said, I think I like the alternative approach you proposed just to be safe.
As per a conversation with @DavidHuber-NOAA, I've separated the total channel counts read (nread) from the thinning box (itxmax). Now, if the channel counts read becomes larger than the integer size, it will only affect the write statements. I've re-run the ctests on hera.
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests/update/build Start 2: rtma Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 1: global_4denvar Start 6: global_enkf Start 3: rrfs_3denvar_rdasens 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 496.10 sec 2/6 Test #2: rtma ............................. Passed 972.37 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1106.76 sec 4/6 Test #6: global_enkf ...................... Passed 1159.74 sec 5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1345.07 sec 6/6 Test #1: global_4denvar ................... Passed 1969.79 sec
83% tests passed, 1 tests failed out of 6
Total Test time (real) = 1969.84 sec
The hafs_4denvar_glbens test failed the time limit test. I restarted the test and it passed.
ctest --rerun-failed --output-on-failure Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests/update/build Start 4: hafs_4denvar_glbens 1/1 Test #4: hafs_4denvar_glbens .............. Passed 1348.48 sec 100% tests passed, 0 tests failed out of 1 Total Test time (real) = 1348.51 sec
I will run the ctests on other machines when I get feedback from the rest of the reviewers.
WCOSS2 (Dogwood) ctests
Install wx20jjung:IASI_debug_fix
at a41a78aa0 on Dogwood. Use develop
at 9f44c879 as the control. ctests results are as follows
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr790/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 729.88 sec
2/6 Test #6: global_enkf ...................... Passed 972.64 sec
3/6 Test #5: hafs_3denvar_hybens .............. Passed 1153.84 sec
4/6 Test #4: hafs_4denvar_glbens .............. Passed 1334.73 sec
5/6 Test #1: global_4denvar ................... Passed 1864.83 sec
6/6 Test #2: rtma ............................. Passed 1869.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1869.76 sec
I've made the various changes, tested everything both in the current branch (IASI_debug_fix) and in my IASI-NG branch in a single cycle test. Both results are as expected. The various satellite parameters in the gsistat (counts, bias, std dev, etc) are identical with (iasi_debug_fix) and without (develop) these changes. I found no more '****' fields in any of the output.
ctests on herculese completed as expected [jjung@hercules-login-4 build]$ ctest -j 6 Test project /work/noaa/nesdis-rdo1/jjung/noscrub/ctests/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 613.04 sec 2/6 Test #6: global_enkf ...................... Passed 785.35 sec 3/6 Test #2: rtma ............................. Passed 1084.97 sec 4/6 Test #5: hafs_3denvar_hybens .............. Passed 1152.51 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 1278.23 sec 6/6 Test #1: global_4denvar ................... Passed 1741.15 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1741.16 sec
Results from jet had 2 problems: Test project /lfs5/HFIP/hfv3gfs/Jim.Jung/noscrub/ctests/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #6: global_enkf ...................... Passed 1225.07 sec 2/6 Test #2: rtma ............................. Passed 1391.81 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1637.28 sec 4/6 Test #4: hafs_4denvar_glbens ..............***Failed 1697.58 sec 5/6 Test #1: global_4denvar ................... Passed 2410.05 sec
The rrfs tests timed out and did not complete. The runtime for hafs_4denvar_glbens_loproc_updat is 335.324241 seconds. This has exceeded maximum allowable threshold time of 328.320275 seconds, resulting in Failure time-thresh of the regression test.
My hera tests have been in the queue for the past 12 hours. I removed them.
WCOSS2 (Cactus) ctests
Install wx20jjung:IASI_debug_fix
at 74947da50 on Dogwood. Run ctests with following results:
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr790/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 734.97 sec
2/6 Test #6: global_enkf ...................... Passed 864.32 sec
3/6 Test #2: rtma ............................. Passed 971.09 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1215.19 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 1334.75 sec
6/6 Test #1: global_4denvar ................... Passed 1683.76 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1683.77 sec
Description
When the GSI is built in debug mode, the code failed when the read_iasi routine called combine_radobs. In tracking down this problem several other minor problems were discovered and fixed.
Resolves #789
The first group of commits are some write format changes I missed in the output of the runtime directory. I also found where the maximum channel number was set to 3000. I changed this array to be dynamic and allocated for each instrument.
The second group of commits resolves 2 errors I found while trying to fix the debug issue. The first was an error some of the IASI temperature values were "NAN"s. I traced this back to some of the cscale values (a scaling factor for IASI radiances, in the BUFR file), were missing values. The missing cscale values are associated with the shortwave side of the water vapor region and the shortwave channels. None of these channels are currently used. These were only found in the direct broadcast data. I did not find an instance where the operational data had missing values. The second error was an integer overflow problem with the variable nread being passed into combine radobs. Nread is used in most satellite instrument reads but the main failure was from read_iasing (to be added later). Nread is a counter and is ultimately number_of_profiles number_of_channels. This number exceeded the memory for an integer and was ultimately a negative number. In combineradobs, nread is compared to the total number of elements of the thinning box or itxmax to the number of elements of the task thinning box. In this case, the task thinning box was negative. I added logic to the various read routines to pass the number of profiles kept on each task. This makes nread and itxmax consistent. This change caused the total counts to be different for some instruments. Here is an example from the gsistat file for IASI. previous: o-g 01 rad metop-b iasi 331026696 6148957 690460 0.24889E+06 0.24889E+06 0.36047 0.36047 new: o-g 01 rad metop-b iasi 44632896 6148957 690460 0.24889E+06 0.24889E+06 0.36047 0.36047 The total counts after thinning and total counts used are identical along with the other statistics.
The actual failure of read_iasi in debug mode came down to putting an if() cycle in a different place. Read_iasi is now consistent with read_cris.
Type of change
How Has This Been Tested?
The main testing and cycling experiments were conducted on S4 at C192 resolution. After a single cycle spinup I setup a "control" and "experiment" and run both independently changing the GSIEXEC variable to point to the gsi master branch (control) and the IASI_debug_fix branch (experiment). The control and experiment were run through 4 cycle. Static tests were also conducted on Jet.
The ctests on Hera all passed: [Jim.Jung@hfe10 build]$ ctest -j 6 Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/save/ctests/update/build Start 1: global_4denvar Start 6: global_enkf Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 492.94 sec 2/6 Test #2: rtma ............................. Passed 970.72 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1043.42 sec 4/6 Test #6: global_enkf ...................... Passed 1118.55 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 1159.69 sec 6/6 Test #1: global_4denvar ................... Passed 1909.17 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1909.21 sec
The rrfs ctest timed out on jet, all others passed. Test project /lfs5/HFIP/hfv3gfs/Jim.Jung/noscrub/ctests/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #6: global_enkf ...................... Passed 1779.50 sec 2/6 Test #2: rtma ............................. Passed 1938.22 sec 3/6 Test #1: global_4denvar ................... Passed 1991.97 sec 4/6 Test #5: hafs_3denvar_hybens .............. Passed 2000.38 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 2310.42 sec
The rrfs test also failed on hercules [jjung@hercules-login-3 build]$ ctest Test project /work/noaa/nesdis-rdo1/jjung/noscrub/ctests/update/build Start 1: global_4denvar 1/6 Test #1: global_4denvar ................... Passed 1741.09 sec Start 2: rtma 2/6 Test #2: rtma ............................. Passed 965.23 sec Start 3: rrfs_3denvar_rdasens 3/6 Test #3: rrfs_3denvar_rdasens .............***Failed 484.58 sec Start 4: hafs_4denvar_glbens 4/6 Test #4: hafs_4denvar_glbens .............. Passed 1161.19 sec Start 5: hafs_3denvar_hybens 5/6 Test #5: hafs_3denvar_hybens .............. Passed 1094.42 sec Start 6: global_enkf 6/6 Test #6: global_enkf ...................... Passed 847.14 sec
83% tests passed, 1 tests failed out of 6
Total Test time (real) = 6293.65 sec
The following tests FAILED: 3 - rrfs_3denvar_rdasens (Failed) The memory for rrfs_3denvar_rdasens_loproc_updat is 1106724 KBs. This has exceeded maximum allowable memory of 1098631 KBs, resulting in Failure memthresh of the regression test.
Checklist
Please add @ADCollard, @DavidHuber-NOAA and @InnocentSouopgui-NOAA as reviewers.