Closed RussTreadon-NOAA closed 5 months ago
Repeat this test on Cactus. test_gdasapp_atm_jjob_ens_run
passes the reference check on Cactus with float relative tolerance=1e-06
OOPS_STATS Run end - Runtime: 415.42 sec, Memory: total: 10.94 Gb, per task: min = 1.41 Gb, max = 2.11 Gb
Run: Finishing oops::LocalEnsemblnid002305.cactus.wcoss2.ncep.noaa.gov 0: eDA<FV3JEDI, UFO and IODA observations> with status = 0
nid002305.cactus.wcoss2.ncep.noaa.gov 0: [TestReference] Comparison is done
OOPS Ending 2024-06-06 17:26:21 (UTC+0000)
Application 9072bd70-3674-4906-baf9-4a6f7343b9f6 resources: utime=2394s stime=47s maxrss=2064092KB inblock=1975532 oublock=2299120 minflt=22567728 majflt=268 nvcsw=44485 nivcsw=1010
2024-06-06 17:26:21,374 - INFO - atmens_analysis: END: pygfs.task.atmens_analysis.letkf
2024-06-06 17:26:21,375 - DEBUG - atmens_analysis: returning: None
+ 134467411.cbqs01.SC[21]: status=0
+
@DavidNew-NOAA , what do you think? Should we increase float relative tolerance
to 1e-05
in order to get test_gdasapp_atm_jjob_ens_run
to pass on all supported machines?
One thing which bothers me is why we need to increase the tolerance by an order of magnitude on Hercules. The var test passes on Hercules with 1e-06
. 1e-06
works as the tolerance for the ens test on other supported machines. Hercules is the outlier for the ens test. Why?
@RussTreadon-NOAA I have float relative tolerance
as 1e-03
and float absolute tolerance
at '1e-05' for test_gdasapp_atm_jjob_ens_run
and test_gdasapp_atm_jjob_var_run
Could you clarify?
Thank you @DavidNew-NOAA for your question. This prompted me to look more closely at our jcb files.
parm/jcb-algorithms/local_ensemble_da.yaml.j2
contains
float relative tolerance: {{test_float_relative_tolerance | default(1.0e-6, true)}}
float absolute tolerance: {{test_float_absolute_tolerance | default(0.0, true) }}
integer tolerance: {{test_integer_tolerance | default(0, true) }}
test/atm/global-workflow/jcb-prototype_lgetkf.yaml.j2
contains
# Testing things
# --------------
test_reference_filename: {{ HOMEgfs }}/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
test_output_filename: ./lgetkf.out
float_relative_tolerance: 1.0e-3
float_absolute_tolerance: 1.0e-5
Note that the float
keywords above do not include the test_
prefix. Thus the ens_init job winds up using the default values of 1e-o6
and 0.0
when creating the input yaml for the ens_run job.
I added the prefix test_
to the float_
keywords in jcb-prototype_lgetkf.yaml.j2
and reran test_gdasapp_atm_jjob_ens_init
. Now I see the desired values in enkfgdas.t18z.atmens.yaml
test:
reference filename: /work2/noaa/da/rtreadon/git/global-workflow/pr2641_hercules/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
test output filename: ./lgetkf.out
float relative tolerance: 0.001
float absolute tolerance: 1e-05
integer tolerance: 0
Which way was your intention? Do we want to users to override default tolerances via keywords starting with test_
or drop test_
and set the float_
keywords?
@RussTreadon-NOAA Ah, yes, nice catch. They should match, so be can change the jcb prototypes for the jjob test to be test_float_relative_tolerance and test_float_absolute_tolerance
Resolved by #1154
test_gdasapp_atm_jjob_ens_run
using GDASAppdevelop
at 825f19c (update JEDI hashes) fails on Hercules. This test passes on Hera and Orion.The Hercules failure is due to the reference test after lgetkf runs.
The input yaml ends with
Increasing
float relative tolerance
to1e-05
allows the reference check to pass.1e-06
works on Orion and Hera. Testtest_gdasapp_atm_jjob_ens_run
does not yet run on WCOSS2. It is possible that a largerfloat relative tolerance
is needed on WCOSS2.