Closed RussTreadon-NOAA closed 3 months ago
@RussTreadon-NOAA Thanks a lot for all your efforts/help on sorting out /resolving those problems. A supplement to the behavior of global_4densvar_loproc test with debug mode GSI . When turn off cris_cads and iasi_cades and removed a few bufr obs files in gsiparm.anl as shown below : ("<" and ">" are for the original and changed versions.
64c64
< tcp_width=70.0,tcp_ermax=7.35,cris_cads=.false.,iasi_cads=.false.,
---
> tcp_width=70.0,tcp_ermax=7.35,cris_cads=.true.,iasi_cads=.true.,
122a123
> iasibufr iasi metop-a iasi_metop-a 0.0 1 1
138a140,142
> crisbufr cris npp cris_npp 0.0 1 0
> crisfsbufr cris-fsr npp cris-fsr_npp 0.0 1 0
> crisfsbufr cris-fsr n20 cris-fsr_n20 0.0 1 0
The GSI in debug mode would become idle after write-out as
nid001959.cactus.wcoss2.ncep.noaa.gov 46: READ_SATWND,nread,ndata,nreal,nodata= 2889058 1227413 26
nid001959.cactus.wcoss2.ncep.noaa.gov 46: 2454826
When the job was killed due to time limit reached, the error message would show :
....
noaa.gov 5: libmpi_intel.so.1 00001482CE1DC231 PMPI_Allreduce Unknown Unknown
nid001742.cactus.wcoss2.ncep.noaa.gov 5: libmpifort_intel. 00001482D42E8856 mpi_allreduce_ Unknown Unknown
nid001742.cactus.wcoss2.ncep.noaa.gov 5: gsi.x 0000000001C5D5C2 read_obsmod_mp_re 1978 read_obs.F90
nid001742.cactus.wcoss2.ncep.noaa.gov 5: gsi.x 0000000001763909 observermod_mp_se 329 observer.F90
....
The following changes made to src/gsi/read_cris,f90
and src/gsi/read_iasi.f90
allow global_4denvar to run to completion using a gsi.x
built in debug mode
read_cris.f90
@@ -176,7 +176,7 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
real(r_kind),allocatable,dimension(:,:):: data_all
real(r_kind) cdist,disterr,disterrmax,dlon00,dlat00,r01
- logical :: outside,iuse,assim,valid,clear
+ logical :: outside,iuse,assim,valid,clear,valid_cluster
logical :: cris,quiet
integer(i_kind) :: ifov, ifor, iscn, instr, ioff, ilat, ilon, sensorindex_cris
@@ -847,7 +847,14 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
if ( cris_cads ) then
call ufbseq(lnbufr,imager_info,83,7,iret,'CRISCS')
- if ( iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
+
+ if ( sum(imager_info(3,:)) > zero ) then
+ valid_cluster = .true.
+ else
+ valid_cluster = .false.
+ endif
+
+ if ( valid_cluster .and. iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
imager_info(3,1) >= zero .and. imager_coeff ) then ! if imager cluster info exists
imager_mean = zero
imager_std_dev = zero
@@ -880,7 +887,11 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
iexponent = -(nint(imager_info(77,i)) -11) ! channel 15 radiance std dev for each cluster.
imager_info(78,i) = imager_info(78,i) * imager_conversion(1) * (ten ** iexponent)
- call crtm_planck_temperature(sensorindex_imager,4,imager_info(76,i),data_all(maxinfo+7+j,itx))
+ if ( imager_info(76,i) > zero .and. imager_info(76,i) < 99999._r_kind ) then
+ call crtm_planck_temperature(sensorindex_imager,4,imager_info(76,i),data_all(maxinfo+7+j,itx))
+ else
+ data_all(maxinfo+7+j,itx) = tbmin
+ endif
data_all(maxinfo+7+j,itx) = max(data_all(maxinfo+7+j,itx),zero)
iexponent = -(nint(imager_info(80,i)) -11) ! channel 16 radiance for each cluster
@@ -890,7 +901,11 @@ subroutine read_cris(mype,val_cris,ithin,isfcalc,rmesh,jsatid,gstime,&
iexponent = -(nint(imager_info(82,i)) -11) ! channel 16 radiance std dev for each cluster.
imager_info(83,i) = imager_info(83,i) * imager_conversion(2) * (ten ** iexponent)
- call crtm_planck_temperature(sensorindex_imager,5,imager_info(81,i),data_all(maxinfo+14+j,itx))
+ if ( imager_info(81,i) > zero .and. imager_info(81,i) < 99999._r_kind ) then
+ call crtm_planck_temperature(sensorindex_imager,5,imager_info(81,i),data_all(maxinfo+14+j,itx))
+ else
+ data_all(maxinfo+14+j,itx) = tbmin
+ endif
data_all(maxinfo+14+j,itx) = max(data_all(maxinfo+14+j,itx),zero)
read_iasi.f90
@@ -206,7 +206,7 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
real(r_kind),allocatable,dimension(:,:):: data_all
real(r_kind) cdist,disterr,disterrmax,dlon00,dlat00
- logical :: outside,iuse,assim,valid
+ logical :: outside,iuse,assim,valid,valid_cluster
logical :: iasi,quiet,cloud_info
integer(i_kind) :: ifov, instr, iscn, ioff, sensorindex_iasi
@@ -812,7 +812,16 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
if ( iasi_cads ) then
call ufbseq(lnbufr,imager_info,33,7,iret,'IASIL1CS')
- if (iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
+
+ if ( sum(imager_info(3,:)) > zero ) then
+ valid_cluster = .true.
+ else
+ valid_cluster = .false.
+ write(6,100) imager_info(3,:)
+100 format('invalid cluster: imager_info(3,:)= ',7(g13.6,1x))
+ endif
+
+ if ( valid_cluster .and. iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
imager_info(3,1) >= zero .and. imager_coeff ) then ! if imager cluster info exists
imager_mean = zero
imager_std_dev = zero
@@ -843,7 +852,16 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
iexponent = -(nint(imager_info(27,i))-5 ) ! channel 4 radiance std dev for each cluster.
imager_info(28,i) = imager_info(28,i) * (ten ** iexponent)
+ if (imager_info(26,i) == zero) then
+ write(6,101) j,i,imager_info(26,i)
+101 format('j,i= ',2(i6,1x),' imager_info(26,i)= ',g13.6)
+ endif
+
+ if ( imager_info(26,i) > zero .and. imager_info(26,i) < 99999._r_kind ) then ! radiance bounds
call crtm_planck_temperature(sensorindex_imager,2,imager_info(26,i),data_all(maxinfo+7+j,itx))
+ else
+ data_all(maxinfo+7+j,itx) = tbmin
+ endif
data_all(maxinfo+7+j,itx) = max(data_all(maxinfo+7+j,itx),zero)
iexponent = -(nint(imager_info(30,i))-5 ) ! channel 5 radiance for each cluster
@@ -852,7 +870,16 @@ subroutine read_iasi(mype,val_iasi,ithin,isfcalc,rmesh,jsatid,gstime,&
iexponent = -(nint(imager_info(32,i))-5 ) ! channel 5 radiance std dev for each cluser.
imager_info(33,i) = imager_info(33,i) * (ten ** iexponent)
- call crtm_planck_temperature(sensorindex_imager,3,imager_info(31,i),data_all(maxinfo+14+j,itx))
+ if (imager_info(31,i) == zero) then
+ write(6,102) j,i,imager_info(31,i)
+102 format('j,i= ',2(i6,1x),' imager_info(31,i)= ',g13.6)
+ endif
+
+ if ( imager_info(31,i) > zero .and. imager_info(31,i) < 99999._r_kind ) then ! radiance bounds
+ call crtm_planck_temperature(sensorindex_imager,3,imager_info(31,i),data_all(maxinfo+14+j,itx))
+ else
+ data_all(maxinfo+14+j,itx) = tbmin
+ endif
data_all(maxinfo+14+j,itx) = max(data_all(maxinfo+14+j,itx),zero)
end do imager_cluster_info
Obviously the write statements are for information only. If the above changes are an acceptable way to handle the error condition, the write statements should be removed.
The changes proposed work fine. I am proposing a different solution if no comments are planned to be added. The current if statement:
if ( sum(imager_info(3,:)) > zero ) then
could be added to the existing if statement a few lines down and the "imager_info(3,1) >= zero" could be removed.
if (iret == 7 .and. imager_info(3,1) <= 100.0_r_kind .and. &
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 558.89 sec 2/6 Test #2: rtma ............................. Passed 970.67 sec 3/6 Test #5: hafs_3denvar_hybens .............. Passed 1101.60 sec 4/6 Test #6: global_enkf ...................... Passed 1137.70 sec 5/6 Test #4: hafs_4denvar_glbens ..............***Failed 1164.99 sec 6/6 Test #1: global_4denvar ................... Passed 1914.85 sec
The hafs_4denvar_glbens ran 0.6 seconds too slow. I suggest moving forward with either of these changes.
This error is concerning. Zero is a legitimate value for a cluster sizes. In the entries I looked at, the values should be set to the BUFR equivalent of missing. The radiance values should also be missing, not zero.
I will be running more tests to determine which data sets the zeros appear, NESDIS operations or a dbnet provider. This error should be fixed upstream.
Thank you @wx20jjung for this change. It is much cleaner than what I propose. I am testing your change on WCOSS.
WCOSS2 test
Install format_changes
on Cactus. Add the changes outlined above to read_cris.f90
and read_iasi.f90
. Compile code in Release
(optimized) mode. Run ctests with following results
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr773/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_rdasens
Start 4: hafs_4denvar_glbens
Start 5: hafs_3denvar_hybens
Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 788.94 sec
2/6 Test #6: global_enkf ...................... Passed 853.15 sec
3/6 Test #2: rtma ............................. Passed 1216.24 sec
4/6 Test #5: hafs_3denvar_hybens .............. Passed 1224.52 sec
5/6 Test #4: hafs_4denvar_glbens .............. Passed 1224.52 sec
6/6 Test #1: global_4denvar ................... Passed 1924.65 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 1924.76 sec
Recompile in debug mode. Only global_4denvar
runs with CADS active. Running global_4denvar
in debug mode generates the error
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
gsi.x 0000000007F0328B Unknown Unknown Unknown
libpthread-2.31.s 000014983E9878C0 Unknown Unknown Unknown
gsi.x 0000000007945A52 crtm_planck_funct 484 CRTM_Planck_Functions.f90
gsi.x 000000000521A6E3 read_iasi_ 846 read_iasi.f90
A zero radiance is being passed to crtm_planck_temperature
. I cross checked my modified read_cris.f90
and read_iasi.f90
with the files in /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/src/gsi
. My modifications match Jim's.
The WCOSS2 failure indicates that we still need to check the radiance being passed into crtm_planck_temperature
.
Add checks to ensure non-zero radiance is passed to crtm_planck_temperature
in CADS sections of read_cris.f90
and read_iasi.f90
. With these additional changes, the debug gsi.x
gets past IASI and CrIS reads for the global_4denvar
case.
I've made several changes to both read_cris.f90 and read_iasi.f90 to wrap if (a positive radiance) then ... @RussTreadon-NOAA would you test these changes on wcoss before I progress further? The two files are in the same place you pulled them before on hera /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/src/gsi. My ctests on hera were successful.
Test project /scratch1/NCEPDEV/jcsda/Jim.Jung/scrub/ctests_russ/update/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_rdasens Start 4: hafs_4denvar_glbens Start 5: hafs_3denvar_hybens Start 6: global_enkf 1/6 Test #3: rrfs_3denvar_rdasens ............. Passed 979.55 sec 2/6 Test #2: rtma ............................. Passed 1574.65 sec 3/6 Test #6: global_enkf ...................... Passed 1626.18 sec 4/6 Test #5: hafs_3denvar_hybens .............. Passed 1650.68 sec 5/6 Test #4: hafs_4denvar_glbens .............. Passed 1713.64 sec 6/6 Test #1: global_4denvar ................... Passed 2392.39 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 2392.47 sec
@wx20jjung , your updated read_cris.f90
and read_iasi.f90
have been copied to a working copy of format_changes
on Cactus. The len
for character variable jsatid
was changed to len=*
as per format_changes
.
gsi.x
and enkf.x
were then built in debug mode. global_4denvar
was run using the debug gsi.x
. The _loprocupdat job is still running and is up to the first outer loop. The revised read_cris.f90
and read_iasi.f90
successfully read in and processed the data in debug mode.
Please merge your modified read_cris.f90
and read_iasi.f90
into your branch format_changes
. The description for PR #773 can be updated to mention this issue and your modifications to read_cris.f90
and read_iasi.f90
. There are some src/enkf
changes which also need to be merged into format_changes
Once all changes have been committed to format_changes
, I can build the updated format_changes
on Cactus in debug and optimized modes and run ctests for both configurations.
Tagging @CatherineThomas-NOAA for awareness.
@TingLei-daprediction found that
gsi.x
aborts inread_iasi.f90
andread_cris.f90
with the error messageforrtl: error (73): floating divide by zero
Addition of prints to the code reveals that the radiance for certain elements of the AVHRR cluster can occasionally be
0.0
. CRTM routinecrtm_planck_temperature
converts radiances to brightness temperatures. The radiance value is a divisor in this calculation. Hence thedivide by zero
error.Debug runs of
gsi.x
also found instances in which the cluster size for each of the 7 elements in the cluster is0.0
. The image size summed across all 7 elements is used as a divisor in both the IASI and CrIS read routines. This, too, triggers adivide by zero
error.Logic has been added to a working copy of GSI
develop
at a5e2a43 to address these situations. With this logic in place the debuggsi.x
runs to completion with CADS active.This issue is opened to document the problem and its resolution.