Closed TingLei-NOAA closed 8 months ago
WCOSS2 test
The following has been done on Cactus
develop
at fca6bea4BUILD_TYPE
in ush/build.sh
from Release
to Debug
gsi.x
and enkf.x
in debug modeglobal_4denvar
in regression/regression_param.sh
ctest -R global_4denvar
_global_4denvar_loprocupdat and _global_4denvar_hiprocupdat ran to completion in debug mode. Neither job hangs. The loproc job took 2835.414645 seconds to complete. The hiproc job took 1534.105594 seconds to complete.
Interestingly (and disturbingly) the initial gradients between the loproc and hiproc jobs differ in the 10th printed digit. The initial total penalties are identical for all 19 printed digits..
loproc
Initial cost function = 6.584168406980320578E+05
Initial gradient norm = 1.700751272515137998E+03
cost,grad,step,b,step? = 1 0 6.584168406980320578E+05 1.700751272515137998E+03 1.057429134927269976E+00 0.000000000000000000E+00 good
cost,grad,step,b,step? = 1 1 6.547802752899429761E+05 2.113894264916150860E+03 1.927405067994416132E+00 1.302727127739893298E+00 good
hiproc
Initial cost function = 6.584168406980320578E+05
Initial gradient norm = 1.700751274435556979E+03
cost,grad,step,b,step? = 1 0 6.584168406980320578E+05 1.700751274435556979E+03 1.057429135803278131E+00 0.000000000000000000E+00 good
cost,grad,step,b,step? = 1 1 6.547802752834755229E+05 2.113894268161601758E+03 1.927405066764274588E+00 1.302727128193786887E+00 good
Differences in WCOSS2 results were observed in PR #616 and #692. Refactoring code yielded reproducible results with respect to the control. Now loproc and hiproc debug runs demonstrate lack of reproducibility.
The WCOSS2 build uses hpc-stack with an older version of the intel compiler. The GSI builds on other platforms use spack-stack modules and newer intel compilers. Are we dealing with a compiler or module issue on WCOSS2? Would repeating the above test on other platforms yield non-reproducible loproc and hiproc results?
@RussTreadon-NOAA Thanks. I will do some further digging for some remaining questions to me and come back with an update. The failure of global_4densvar for non-reproducible results (update vs contrl) is reported in https://github.com/NOAA-EMC/GSI/pull/679#issuecomment-1992332298.
@xincjin-NOAA 's PR #692 yields reproducible global_4denvar results on WCOSS2 (Cactus). PR #692 is now the head of develop
.
PR #692 updates the global_4denvar case date to include gmi data (monitored, not assimilated). Repeating the above debug test on Cactus from the current f282a94 head of develop
results in a floating invalid segmentation fault in read_ozone.f90
.
GSI develop
at f282a94 seg faults in read_ozone.f90
due to an inconsistency between the mnemonics GSI uses to read the GOME bufr dump file and the mnemonics actually encoded in the file.
The DOYR mnemonic was replaced by MNTH DAYS effective 20240131 18Z. The GOME reader in read_ozone.f90
needs to be updated accordingly. This was done in a working copy of develop
on Cactus. After this change gsi.x
ran to completion in debug mode in 6474.196832 seconds.
Issue #716 was opened to document addition of the required changes to read_ozone.f90
By adding extra debug compiler options ( -init=snan,arrays ) , an apparent mis use of variables was found in read_nsstbufr.f90. To document this fix, a draft pr was created at my fork : https://github.com/TingLei-daprediction/GSI/pull/2.
After this fix was applied for both control and update in global_ens4dvar test, all runs finished "in time". loproc_contrl and loproc_updat produce identical results. But hiproc_contrl and hiproc_updat gave different results . The differences were first shown in initial gradient.
@ADCollard @emilyhcliu @XuLi-NOAA Would you please confirm/correct the fix in read_sstbufr.f90 following "changed files" in the above draft PR?
BTW: I hadn't updated my GSI (both control and update to the current HEAD of GSI to, hopefully, make things simpler).
A reduced version of global_4densvar was run with radiance obs removed. The obs setup in gisparm.anl is as below
OBS_INPUT::
! dfile dtype dplat dsis dval dthin dsfcalc
prepbufr ps null ps 0.0 0 0
prepbufr t null t 0.0 0 0
prepbufr_profl t null t 0.0 0 0
hdobbufr t null t 0.0 0 0
prepbufr q null q 0.0 0 0
prepbufr_profl q null q 0.0 0 0
hdobbufr q null q 0.0 0 0
prepbufr pw null pw 0.0 0 0
prepbufr uv null uv 0.0 0 0
prepbufr_profl uv null uv 0.0 0 0
satwndbufr uv null uv 0.0 0 0
hdobbufr uv null uv 0.0 0 0
prepbufr spd null spd 0.0 0 0
hdobbufr spd null spd 0.0 0 0
prepbufr dw null dw 0.0 0 0
radarbufr rw null rw 0.0 0 0
The similar behavior of GSI was found, namely, loproc_contrl and loproc_updat show identical results while the hiproc ones show differences from the lorproc ones and between themselves. So, the culprit seems not specific to radiance observations.
Another "reduced" version of the global_4densvar still showed the same behavior, in which , only static B was used (namely, a 3DVar with fgat).
An interesting findings: running global_4densvar test with a reduced setup as in the previous runs ( only 2 global members were used), when factqmin=factqmax=0 ( namely this constraint is turned off), this test would indeed succeed (only with " Failure of max-time in the regression test"). So further digging on this issue could focus on the related codes/steps.
Another update: using debug mode built GSI, the global_4densvar failed on hera for the same reason as on wcoss2, though GSI (both update and contrl) hasn't been updated to the current head of emc gsi.
A modification within one OpenMP directive appears to have addressed the reproducibility issue observed between loproc and hiproc runs in the reduced version of global_4denvar ( only use 2 members and the maximum inner iteration steps of 5 ).
The changes can be reviewed https://github.com/TingLei-daprediction/GSI/pull/2/files#diff-ff9860deeec140b2a1307734f3bf0ba00df64a66ae682aea121de529536926bf.
I will update the control and the PR to the current head of EMC GSI and see if GSI works as expected.
An update on the global_4densvar using update and control updated with the current head of EMC GSI develop branch. The GSI are built with the debug mode (but for the control GSI, -init=snan was not used, otherwise the control run would fail as reported for the issue in read_nsstbufr.f90). As expected, the loproc_contrl = lorpoc_updat and loproc_updat=hiproc_update, but loproc_contrl != hiproc_contrl. @RussTreadon-NOAA Do you think I should open a separate PR for changes mentioned in this issue for review or they could be in the current PR https://github.com/NOAA-EMC/GSI/pull/698 ?
I suggest a separate PR for the intjcmod.f90
omp bug fix. PR #698 addresses a different problem.
On wcoss2, when GSI is built with the debug mode, GSI would become idle and the job would finally be killed for , like ,
the error message would show:
The line 311 in genstat_gps.f90 is
The reason for GSI hanging at this point needs to be investigated. Added on Mar. 15,2024, another issue was found that loproc_updat !=hiproc_updat and loproc_contrl !=hirpoc_contrl and hirpoc_updat !=hirpoc_contrl , only loproc_contrl=loproc_updat.