NOAA-EMC / GDASApp

Global Data Assimilation System Application
GNU Lesser General Public License v2.1
14 stars 28 forks source link

increase tolerance for lgetkf.x reference check #1148

Closed RussTreadon-NOAA closed 4 weeks ago

RussTreadon-NOAA commented 4 weeks ago

test_gdasapp_atm_jjob_ens_run using GDASApp develop at 825f19c (update JEDI hashes) fails on Hercules.  This test passes on Hera and Orion.

The Hercules failure is due to the reference test after lgetkf runs.

0: OOPS_STATS Run end                                  - Runtime:    456.38 sec,  Memory: total:    22.59 Gb, per task: min =     3.36 Gb, max =     4.06 Gb
0: Run: Finishing oops::LocalEnsembleDA<FV3JEDI, UFO and IODA observations> with status = 0
0: terminate called after throwing an instance of 'oops::TestReferenceFloatMismatchError'
0:   what():  Test reference Float mismatch @ Line:149
0: Test Val : 3.9113397703941590e-04
0: Ref  Val : 3.9113337546012496e-04
0: Delta    : 6.0157929093924978e-10
0: Relative tolerance: 3.9113367624977039e-10
0: Absolute tolerance: 0.0000000000000000e+00
0: Test Line: 'cloud_liquid_ice                             | Min:+0.0000000000000000e+00 Max:+3.9113397703941590e-04 RMS:+1.0484802023479406e-05'
0: Ref Line : 'cloud_liquid_ice                             | Min:+0.0000000000000000e+00 Max:+3.9113337546012496e-04 RMS:+1.0484801913924773e-05'
srun: error: hercules-07-15: task 0: Aborted (core dumped)

The input yaml ends with

test:
  reference filename: /work2/noaa/da/rtreadon/git/global-workflow/pr2641_hercules/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
  test output filename: ./lgetkf.out
  float relative tolerance: 1e-06
  float absolute tolerance: 0.0
  integer tolerance: 0

Increasing float relative tolerance to 1e-05 allows the reference check to pass.

1e-06 works on Orion and Hera. Test test_gdasapp_atm_jjob_ens_run does not yet run on WCOSS2. It is possible that a larger float relative tolerance is needed on WCOSS2.

RussTreadon-NOAA commented 4 weeks ago

Repeat this test on Cactus. test_gdasapp_atm_jjob_ens_run passes the reference check on Cactus with float relative tolerance=1e-06

OOPS_STATS Run end                                  - Runtime:    415.42 sec,  Memory: total:    10.94 Gb, per task: min =     1.41 Gb, max =     2.11 Gb
Run: Finishing oops::LocalEnsemblnid002305.cactus.wcoss2.ncep.noaa.gov 0: eDA<FV3JEDI, UFO and IODA observations> with status = 0
nid002305.cactus.wcoss2.ncep.noaa.gov 0: [TestReference] Comparison is done
OOPS Ending   2024-06-06 17:26:21 (UTC+0000)
Application 9072bd70-3674-4906-baf9-4a6f7343b9f6 resources: utime=2394s stime=47s maxrss=2064092KB inblock=1975532 oublock=2299120 minflt=22567728 majflt=268 nvcsw=44485 nivcsw=1010
2024-06-06 17:26:21,374 - INFO     - atmens_analysis:   END: pygfs.task.atmens_analysis.letkf
2024-06-06 17:26:21,375 - DEBUG    - atmens_analysis:  returning: None
+ 134467411.cbqs01.SC[21]: status=0
+
RussTreadon-NOAA commented 4 weeks ago

@DavidNew-NOAA , what do you think? Should we increase float relative tolerance to 1e-05 in order to get test_gdasapp_atm_jjob_ens_run to pass on all supported machines?

One thing which bothers me is why we need to increase the tolerance by an order of magnitude on Hercules. The var test passes on Hercules with 1e-06. 1e-06 works as the tolerance for the ens test on other supported machines. Hercules is the outlier for the ens test. Why?

DavidNew-NOAA commented 4 weeks ago

@RussTreadon-NOAA I have float relative tolerance as 1e-03 and float absolute tolerance at '1e-05' for test_gdasapp_atm_jjob_ens_run and test_gdasapp_atm_jjob_var_run Could you clarify?

RussTreadon-NOAA commented 4 weeks ago

Thank you @DavidNew-NOAA for your question. This prompted me to look more closely at our jcb files.

parm/jcb-algorithms/local_ensemble_da.yaml.j2 contains

  float relative tolerance: {{test_float_relative_tolerance | default(1.0e-6, true)}}
  float absolute tolerance: {{test_float_absolute_tolerance | default(0.0, true) }}
  integer tolerance: {{test_integer_tolerance | default(0, true) }}

test/atm/global-workflow/jcb-prototype_lgetkf.yaml.j2 contains

# Testing things
# --------------
test_reference_filename: {{ HOMEgfs }}/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
test_output_filename: ./lgetkf.out
float_relative_tolerance: 1.0e-3
float_absolute_tolerance: 1.0e-5

Note that the float keywords above do not include the test_ prefix. Thus the ens_init job winds up using the default values of 1e-o6 and 0.0 when creating the input yaml for the ens_run job.

I added the prefix test_ to the float_ keywords in jcb-prototype_lgetkf.yaml.j2 and reran test_gdasapp_atm_jjob_ens_init. Now I see the desired values in enkfgdas.t18z.atmens.yaml

test:
  reference filename: /work2/noaa/da/rtreadon/git/global-workflow/pr2641_hercules/sorc/gdas.cd/test/atm/global-workflow/lgetkf.ref
  test output filename: ./lgetkf.out
  float relative tolerance: 0.001
  float absolute tolerance: 1e-05
  integer tolerance: 0

Which way was your intention? Do we want to users to override default tolerances via keywords starting with test_ or drop test_ and set the float_ keywords?

DavidNew-NOAA commented 4 weeks ago

@RussTreadon-NOAA Ah, yes, nice catch. They should match, so be can change the jcb prototypes for the jjob test to be test_float_relative_tolerance and test_float_absolute_tolerance

RussTreadon-NOAA commented 4 weeks ago

Resolved by #1154