NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 151 forks source link

GSI built with debug mode failed in the test global_4denvar on wcoss2 #712

Closed TingLei-NOAA closed 8 months ago

TingLei-NOAA commented 8 months ago

On wcoss2, when GSI is built with the debug mode, GSI would become idle and the job would finally be killed for , like ,

PBS: job killed: walltime 12607 exceeded limit 12600

the error message would show:

nid001408.cactus.wcoss2.ncep.noaa.gov 76: forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
gsi.x              0000000007B4769B  Unknown               Unknown  Unknown
libpthread-2.31.s  000014BE749408C0  Unknown               Unknown  Unknown
.......
libmpi_intel.so.1  000014BE70BA2394  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014BE6EF08231  PMPI_Allreduce        Unknown  Unknown
libmpifort_intel.  000014BE75014856  mpi_allreduce_        Unknown  Unknown
gsi.x              0000000000982CA8  m_gpsstats_mp_gen         311  genstats_gps.f90
gsi.x              0000000002600FF7  setuprhsall_              531  setuprhsall.f90

The line 311 in genstat_gps.f90 is

  call mpi_allreduce(toss_gps_sub,toss_gps,nprof_gps,mpi_rtype,mpi_max,&
       mpi_comm_world,ierror)

The reason for GSI hanging at this point needs to be investigated. Added on Mar. 15,2024, another issue was found that loproc_updat !=hiproc_updat and loproc_contrl !=hirpoc_contrl and hirpoc_updat !=hirpoc_contrl , only loproc_contrl=loproc_updat.

RussTreadon-NOAA commented 8 months ago

WCOSS2 test

The following has been done on Cactus

  1. clone develop at fca6bea4
  2. change BUILD_TYPE in ush/build.sh from Release to Debug
  3. build gsi.x and enkf.x in debug mode
  4. increase wall clock limit to 3 hours for global_4denvar in regression/regression_param.sh
  5. execute ctest -R global_4denvar

_global_4denvar_loprocupdat and _global_4denvar_hiprocupdat ran to completion in debug mode. Neither job hangs. The loproc job took 2835.414645 seconds to complete. The hiproc job took 1534.105594 seconds to complete.

Interestingly (and disturbingly) the initial gradients between the loproc and hiproc jobs differ in the 10th printed digit. The initial total penalties are identical for all 19 printed digits..

loproc

Initial cost function =  6.584168406980320578E+05
Initial gradient norm =  1.700751272515137998E+03
cost,grad,step,b,step? =   1   0  6.584168406980320578E+05  1.700751272515137998E+03  1.057429134927269976E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  6.547802752899429761E+05  2.113894264916150860E+03  1.927405067994416132E+00  1.302727127739893298E+00  good

hiproc

Initial cost function =  6.584168406980320578E+05
Initial gradient norm =  1.700751274435556979E+03
cost,grad,step,b,step? =   1   0  6.584168406980320578E+05  1.700751274435556979E+03  1.057429135803278131E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  6.547802752834755229E+05  2.113894268161601758E+03  1.927405066764274588E+00  1.302727128193786887E+00  good

Differences in WCOSS2 results were observed in PR #616 and #692. Refactoring code yielded reproducible results with respect to the control. Now loproc and hiproc debug runs demonstrate lack of reproducibility.

The WCOSS2 build uses hpc-stack with an older version of the intel compiler. The GSI builds on other platforms use spack-stack modules and newer intel compilers. Are we dealing with a compiler or module issue on WCOSS2? Would repeating the above test on other platforms yield non-reproducible loproc and hiproc results?

TingLei-NOAA commented 8 months ago

@RussTreadon-NOAA Thanks. I will do some further digging for some remaining questions to me and come back with an update. The failure of global_4densvar for non-reproducible results (update vs contrl) is reported in https://github.com/NOAA-EMC/GSI/pull/679#issuecomment-1992332298.

RussTreadon-NOAA commented 8 months ago

@xincjin-NOAA 's PR #692 yields reproducible global_4denvar results on WCOSS2 (Cactus). PR #692 is now the head of develop.

RussTreadon-NOAA commented 8 months ago

PR #692 updates the global_4denvar case date to include gmi data (monitored, not assimilated). Repeating the above debug test on Cactus from the current f282a94 head of develop results in a floating invalid segmentation fault in read_ozone.f90.

RussTreadon-NOAA commented 8 months ago

GSI develop at f282a94 seg faults in read_ozone.f90 due to an inconsistency between the mnemonics GSI uses to read the GOME bufr dump file and the mnemonics actually encoded in the file.

The DOYR mnemonic was replaced by MNTH DAYS effective 20240131 18Z. The GOME reader in read_ozone.f90 needs to be updated accordingly. This was done in a working copy of develop on Cactus. After this change gsi.x ran to completion in debug mode in 6474.196832 seconds.

Issue #716 was opened to document addition of the required changes to read_ozone.f90

TingLei-NOAA commented 8 months ago

By adding extra debug compiler options ( -init=snan,arrays ) , an apparent mis use of variables was found in read_nsstbufr.f90. To document this fix, a draft pr was created at my fork : https://github.com/TingLei-daprediction/GSI/pull/2.
After this fix was applied for both control and update in global_ens4dvar test, all runs finished "in time". loproc_contrl and loproc_updat produce identical results. But hiproc_contrl and hiproc_updat gave different results . The differences were first shown in initial gradient. @ADCollard @emilyhcliu @XuLi-NOAA Would you please confirm/correct the fix in read_sstbufr.f90 following "changed files" in the above draft PR? BTW: I hadn't updated my GSI (both control and update to the current HEAD of GSI to, hopefully, make things simpler).

TingLei-NOAA commented 8 months ago

A reduced version of global_4densvar was run with radiance obs removed. The obs setup in gisparm.anl is as below

OBS_INPUT::
!  dfile          dtype       dplat       dsis                dval    dthin dsfcalc
   prepbufr       ps          null        ps                  0.0     0     0
   prepbufr       t           null        t                   0.0     0     0
   prepbufr_profl t           null        t                   0.0     0     0
   hdobbufr       t           null        t                   0.0     0     0
   prepbufr       q           null        q                   0.0     0     0
   prepbufr_profl q           null        q                   0.0     0     0
   hdobbufr       q           null        q                   0.0     0     0
   prepbufr       pw          null        pw                  0.0     0     0
   prepbufr       uv          null        uv                  0.0     0     0
   prepbufr_profl uv          null        uv                  0.0     0     0
   satwndbufr     uv          null        uv                  0.0     0     0
   hdobbufr       uv          null        uv                  0.0     0     0
   prepbufr       spd         null        spd                 0.0     0     0
   hdobbufr       spd         null        spd                 0.0     0     0
   prepbufr       dw          null        dw                  0.0     0     0
   radarbufr      rw          null        rw                  0.0     0     0

The similar behavior of GSI was found, namely, loproc_contrl and loproc_updat show identical results while the hiproc ones show differences from the lorproc ones and between themselves. So, the culprit seems not specific to radiance observations.

TingLei-NOAA commented 8 months ago

Another "reduced" version of the global_4densvar still showed the same behavior, in which , only static B was used (namely, a 3DVar with fgat).

TingLei-NOAA commented 8 months ago

An interesting findings: running global_4densvar test with a reduced setup as in the previous runs ( only 2 global members were used), when factqmin=factqmax=0 ( namely this constraint is turned off), this test would indeed succeed (only with " Failure of max-time in the regression test"). So further digging on this issue could focus on the related codes/steps.

TingLei-NOAA commented 8 months ago

Another update: using debug mode built GSI, the global_4densvar failed on hera for the same reason as on wcoss2, though GSI (both update and contrl) hasn't been updated to the current head of emc gsi.

TingLei-NOAA commented 8 months ago

A modification within one OpenMP directive appears to have addressed the reproducibility issue observed between loproc and hiproc runs in the reduced version of global_4denvar ( only use 2 members and the maximum inner iteration steps of 5 ). The changes can be reviewed https://github.com/TingLei-daprediction/GSI/pull/2/files#diff-ff9860deeec140b2a1307734f3bf0ba00df64a66ae682aea121de529536926bf.
I will update the control and the PR to the current head of EMC GSI and see if GSI works as expected.

RussTreadon-NOAA commented 8 months ago

Great detective work! ii should clearly be declared private in threaded loop in subroutine intlimq (file intjcmod.f90).

TingLei-NOAA commented 8 months ago

An update on the global_4densvar using update and control updated with the current head of EMC GSI develop branch. The GSI are built with the debug mode (but for the control GSI, -init=snan was not used, otherwise the control run would fail as reported for the issue in read_nsstbufr.f90). As expected, the loproc_contrl = lorpoc_updat and loproc_updat=hiproc_update, but loproc_contrl != hiproc_contrl. @RussTreadon-NOAA Do you think I should open a separate PR for changes mentioned in this issue for review or they could be in the current PR https://github.com/NOAA-EMC/GSI/pull/698 ?

RussTreadon-NOAA commented 8 months ago

I suggest a separate PR for the intjcmod.f90 omp bug fix. PR #698 addresses a different problem.