NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 150 forks source link

CADS code seg faults in debug mode #775

Closed RussTreadon-NOAA closed 3 months ago

RussTreadon-NOAA commented 4 months ago

@TingLei-daprediction found that gsi.x aborts in read_iasi.f90 and read_cris.f90 with the error message forrtl: error (73): floating divide by zero

Addition of prints to the code reveals that the radiance for certain elements of the AVHRR cluster can occasionally be 0.0. CRTM routine crtm_planck_temperature converts radiances to brightness temperatures. The radiance value is a divisor in this calculation. Hence the divide by zero error.

Debug runs of gsi.x also found instances in which the cluster size for each of the 7 elements in the cluster is 0.0. The image size summed across all 7 elements is used as a divisor in both the IASI and CrIS read routines. This, too, triggers a divide by zero error.

Logic has been added to a working copy of GSI develop at a5e2a43 to address these situations. With this logic in place the debug gsi.x runs to completion with CADS active.

This issue is opened to document the problem and its resolution.

TingLei-NOAA commented 4 months ago

@RussTreadon-NOAA Thanks a lot for all your efforts/help on sorting out /resolving those problems. A supplement to the behavior of global_4densvar_loproc test with debug mode GSI . When turn off cris_cads and iasi_cades and removed a few bufr obs files in gsiparm.anl as shown below : ("<" and ">" are for the original and changed versions.

64c64
<    tcp_width=70.0,tcp_ermax=7.35,cris_cads=.false.,iasi_cads=.false.,
---
>    tcp_width=70.0,tcp_ermax=7.35,cris_cads=.true.,iasi_cads=.true.,
122a123
>    iasibufr       iasi        metop-a     iasi_metop-a        0.0     1     1
138a140,142
>    crisbufr       cris        npp         cris_npp            0.0     1     0
>    crisfsbufr     cris-fsr    npp         cris-fsr_npp        0.0     1     0
>    crisfsbufr     cris-fsr    n20         cris-fsr_n20        0.0     1     0

The GSI in debug mode would become idle after write-out as

nid001959.cactus.wcoss2.ncep.noaa.gov 46:  READ_SATWND,nread,ndata,nreal,nodata=     2889058     1227413          26
nid001959.cactus.wcoss2.ncep.noaa.gov 46:      2454826

When the job was killed due to time limit reached, the error message would show :

....
noaa.gov 5: libmpi_intel.so.1  00001482CE1DC231  PMPI_Allreduce        Unknown  Unknown
nid001742.cactus.wcoss2.ncep.noaa.gov 5: libmpifort_intel.  00001482D42E8856  mpi_allreduce_        Unknown  Unknown
nid001742.cactus.wcoss2.ncep.noaa.gov 5: gsi.x              0000000001C5D5C2  read_obsmod_mp_re        1978  read_obs.F90
nid001742.cactus.wcoss2.ncep.noaa.gov 5: gsi.x              0000000001763909  observermod_mp_se         329  observer.F90
....
RussTreadon-NOAA commented 4 months ago

The following changes made to src/gsi/read_cris,f90 and src/gsi/read_iasi.f90 allow global_4denvar to run to completion using a gsi.x built in debug mode

read_cris.f90

@@ -176,7 +176,7 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
   real(r_kind),allocatable,dimension(:,:):: data_all
   real(r_kind) cdist,disterr,disterrmax,dlon00,dlat00,r01

-  logical          :: outside,iuse,assim,valid,clear
+  logical          :: outside,iuse,assim,valid,clear,valid_cluster
   logical          :: cris,quiet

   integer(i_kind)  :: ifov, ifor, iscn, instr, ioff, ilat, ilon, sensorindex_cris
@@ -847,7 +847,14 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&

            if ( cris_cads ) then
              call ufbseq(lnbufr,imager_info,83,7,iret,'CRISCS')
-             if ( iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
+
+             if ( sum(imager_info(3,:)) > zero ) then
+                valid_cluster = .true.
+             else
+                valid_cluster = .false.
+             endif
+             
+             if ( valid_cluster .and. iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
                   imager_info(3,1) >= zero .and. imager_coeff ) then   ! if imager cluster info exists
                imager_mean = zero
                imager_std_dev = zero
@@ -880,7 +887,11 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
                  iexponent = -(nint(imager_info(77,i)) -11)                        ! channel 15 radiance std dev for each cluster.
                  imager_info(78,i) =  imager_info(78,i) * imager_conversion(1) * (ten ** iexponent) 

-                 call crtm_planck_temperature(sensorindex_imager,4,imager_info(76,i),data_all(maxinfo+7+j,itx))
+                 if ( imager_info(76,i) > zero .and. imager_info(76,i) < 99999._r_kind ) then                     
+                    call crtm_planck_temperature(sensorindex_imager,4,imager_info(76,i),data_all(maxinfo+7+j,itx))
+                 else
+                    data_all(maxinfo+7+j,itx) = tbmin
+                 endif
                  data_all(maxinfo+7+j,itx) = max(data_all(maxinfo+7+j,itx),zero)

                  iexponent = -(nint(imager_info(80,i)) -11)                        ! channel 16 radiance for each cluster
@@ -890,7 +901,11 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
                  iexponent = -(nint(imager_info(82,i)) -11)                        ! channel 16 radiance std dev for each cluster.
                  imager_info(83,i) =  imager_info(83,i) * imager_conversion(2) * (ten ** iexponent)

-                 call crtm_planck_temperature(sensorindex_imager,5,imager_info(81,i),data_all(maxinfo+14+j,itx))
+                 if ( imager_info(81,i) > zero .and. imager_info(81,i) < 99999._r_kind ) then
+                    call crtm_planck_temperature(sensorindex_imager,5,imager_info(81,i),data_all(maxinfo+14+j,itx))
+                 else
+                    data_all(maxinfo+14+j,itx) = tbmin
+                 endif
                  data_all(maxinfo+14+j,itx) = max(data_all(maxinfo+14+j,itx),zero)

read_iasi.f90

@@ -206,7 +206,7 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
   real(r_kind),allocatable,dimension(:,:):: data_all
   real(r_kind) cdist,disterr,disterrmax,dlon00,dlat00

-  logical          :: outside,iuse,assim,valid
+  logical          :: outside,iuse,assim,valid,valid_cluster
   logical          :: iasi,quiet,cloud_info

   integer(i_kind)  :: ifov, instr, iscn, ioff, sensorindex_iasi
@@ -812,7 +812,16 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&

            if ( iasi_cads ) then
              call ufbseq(lnbufr,imager_info,33,7,iret,'IASIL1CS')
-             if (iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
+
+             if ( sum(imager_info(3,:)) > zero ) then
+                valid_cluster = .true.
+             else
+                valid_cluster = .false.
+                write(6,100) imager_info(3,:)
+100             format('invalid cluster: imager_info(3,:)= ',7(g13.6,1x))
+             endif
+             
+             if ( valid_cluster .and. iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
                   imager_info(3,1) >= zero .and. imager_coeff ) then   ! if imager cluster info exists
                imager_mean = zero
                imager_std_dev = zero
@@ -843,7 +852,16 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
                  iexponent = -(nint(imager_info(27,i))-5 )                        ! channel 4 radiance std dev for each cluster.
                  imager_info(28,i) =  imager_info(28,i) * (ten ** iexponent)

+                 if (imager_info(26,i) == zero) then
+                    write(6,101) j,i,imager_info(26,i)
+101                 format('j,i= ',2(i6,1x),' imager_info(26,i)= ',g13.6)
+                 endif
+                 
+                 if ( imager_info(26,i) > zero .and. imager_info(26,i) < 99999._r_kind ) then  ! radiance bounds
                  call crtm_planck_temperature(sensorindex_imager,2,imager_info(26,i),data_all(maxinfo+7+j,itx))
+                 else
+                    data_all(maxinfo+7+j,itx) = tbmin
+                 endif
                  data_all(maxinfo+7+j,itx) = max(data_all(maxinfo+7+j,itx),zero)

                  iexponent = -(nint(imager_info(30,i))-5 )                        ! channel 5 radiance for each cluster
@@ -852,7 +870,16 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
                  iexponent = -(nint(imager_info(32,i))-5 )                        ! channel 5 radiance std dev for each cluser.
                  imager_info(33,i) =  imager_info(33,i) * (ten ** iexponent)

-                 call crtm_planck_temperature(sensorindex_imager,3,imager_info(31,i),data_all(maxinfo+14+j,itx))
+                 if (imager_info(31,i) == zero) then
+                    write(6,102) j,i,imager_info(31,i)
+102                 format('j,i= ',2(i6,1x),' imager_info(31,i)= ',g13.6)
+                 endif
+                 
+                 if ( imager_info(31,i) > zero .and. imager_info(31,i) < 99999._r_kind ) then  ! radiance bounds
+                    call crtm_planck_temperature(sensorindex_imager,3,imager_info(31,i),data_all(maxinfo+14+j,itx))
+                 else
+                    data_all(maxinfo+14+j,itx) = tbmin
+                 endif
                  data_all(maxinfo+14+j,itx) = max(data_all(maxinfo+14+j,itx),zero)

                end do imager_cluster_info

Obviously the write statements are for information only. If the above changes are an acceptable way to handle the error condition, the write statements should be removed.

wx20jjung commented 4 months ago

The changes proposed work fine. I am proposing a different solution if no comments are planned to be added. The current if statement:

        if ( sum(imager_info(3,:)) > zero ) then

could be added to the existing if statement a few lines down and the "imager_info(3,1) >= zero" could be removed.

          if (iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &

Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 558.89 sec 2/6 Test #2: rtma ............................. Passed 970.67 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1101.60 sec 4/6 Test #6: global_enkf ...................... Passed 1137.70 sec 5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1164.99 sec 6/6 Test #1: global_4denvar ................... Passed 1914.85 sec

The hafs_4denvar_glbens ran 0.6 seconds too slow. I suggest moving forward with either of these changes.

This error is concerning. Zero is a legitimate value for a cluster sizes. In the entries I looked at, the values should be set to the BUFR equivalent of missing. The radiance values should also be missing, not zero.

I will be running more tests to determine which data sets the zeros appear, NESDIS operations or a dbnet provider. This error should be fixed upstream.

RussTreadon-NOAA commented 4 months ago

Thank you @wx20jjung for this change. It is much cleaner than what I propose. I am testing your change on WCOSS.

RussTreadon-NOAA commented 4 months ago

WCOSS2 test Install format_changes on Cactus. Add the changes outlined above to read_cris.f90 and read_iasi.f90. Compile code in Release (optimized) mode. Run ctests with following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr773/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............   Passed  788.94 sec
2/6 Test #6: global_enkf ......................   Passed  853.15 sec
3/6 Test #2: rtma .............................   Passed  1216.24 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1224.52 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1224.52 sec
6/6 Test #1: global_4denvar ...................   Passed  1924.65 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 1924.76 sec

Recompile in debug mode. Only global_4denvar runs with CADS active. Running global_4denvar in debug mode generates the error

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
gsi.x              0000000007F0328B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014983E9878C0  Unknown               Unknown  Unknown
gsi.x              0000000007945A52  crtm_planck_funct         484  CRTM_Planck_Functions.f90
gsi.x              000000000521A6E3  read_iasi_                846  read_iasi.f90

A zero radiance is being passed to crtm_planck_temperature. I cross checked my modified read_cris.f90 and read_iasi.f90 with the files in /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/src/gsi. My modifications match Jim's.

The WCOSS2 failure indicates that we still need to check the radiance being passed into crtm_planck_temperature.

RussTreadon-NOAA commented 4 months ago

Add checks to ensure non-zero radiance is passed to crtm_planck_temperature in CADS sections of read_cris.f90 and read_iasi.f90. With these additional changes, the debug gsi.x gets past IASI and CrIS reads for the global_4denvar case.

wx20jjung commented 4 months ago

I've made several changes to both read_cris.f90 and read_iasi.f90 to wrap if (a positive radiance) then ... @RussTreadon-NOAA would you test these changes on wcoss before I progress further? The two files are in the same place you pulled them before on hera /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/src/gsi. My ctests on hera were successful.

Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 979.55 sec 2/6 Test #2: rtma ............................. Passed 1574.65 sec 3/6 Test #6: global_enkf ...................... Passed 1626.18 sec 4/6 Test #5: hafs_3denvar_hybens .............. Passed 1650.68 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 1713.64 sec 6/6 Test #1: global_4denvar ................... Passed 2392.39 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 2392.47 sec

RussTreadon-NOAA commented 4 months ago

@wx20jjung , your updated read_cris.f90 and read_iasi.f90 have been copied to a working copy of format_changes on Cactus. The len for character variable jsatid was changed to len=* as per format_changes.

gsi.x and enkf.x were then built in debug mode. global_4denvar was run using the debug gsi.x. The _loprocupdat job is still running and is up to the first outer loop. The revised read_cris.f90 and read_iasi.f90 successfully read in and processed the data in debug mode.

Please merge your modified read_cris.f90 and read_iasi.f90 into your branch format_changes. The description for PR #773 can be updated to mention this issue and your modifications to read_cris.f90 and read_iasi.f90. There are some src/enkf changes which also need to be merged into format_changes

Once all changes have been committed to format_changes, I can build the updated format_changes on Cactus in debug and optimized modes and run ctests for both configurations.

RussTreadon-NOAA commented 3 months ago

Tagging @CatherineThomas-NOAA for awareness.