Create framework for extracting, verifying and updating model reproducibility and performance

aidanheerdegen commented 2 years ago

As a model component developer I want a tool verify model reproducibility. As a model developer I want the same tool to work for all the components of the model.

As a release team member I want to be able to use the same tool when developing CI tests for a number of different models.

As a user of models I want to be confident that model updates will not change the answers of my experiments, unless this has been specifically documented.

aidanheerdegen commented 2 years ago

Existing tools

Specific tools developed by Paul Leopardi to extract performance stats for scaling testing https://github.com/penguian/performance-analysis
Andrew Kiss' tool for extracting information from ACCESS-OM2 runs to create a run summary https://github.com/aekiss/run_summary
Marshall Ward's tools for performance monitoring of MOM6 https://github.com/NOAA-GFDL/MOM6/tree/dev/gfdl/.testing/tools

MartinDix commented 2 years ago

UM rose stem testing checks for bitwise reproducibility by comparing output files to Known Good Output (KGO) files on disk (e.g. in /g/data/access/KGO/standard_jobs) using mule-cumf (compares UM fieldsfiles).

Also compares Helmholtz solver statistics from model log files to those in KGO file.

***************************************************************
*    Linear solve for Helmholtz problem                       *
* Outer Inner Iterations   InitialError       FinalError      *
*    2     1        8      0.122268E+00      0.991602E-04     *
*    2     2        2      0.123264E-02      0.985385E-04     *
***************************************************************

These are effectively a global checksum and even with the limited precision any differences in a run show up in a few hours. Values could be stored in a DB somewhere rather than extracted from the KGO.

There are also comparisons of execution time by extracting run time from model logfiles PE 0 Elapsed Wallclock Time: 51.48 but in practice elapsed time for short test runs on gadi is too variable for this to be much use.

JULES rose stem uses bitwise comparison of netCDF KGO files with nccmp.

LFRic rose stem uses a simple checksum (global sum of X²) of selected fields calculated during the run.

aidanheerdegen commented 2 years ago

Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate.

As for extracting performance information, it is likely this would be limited to runs of sufficient length to give useful data.

Do the UM, JULES/CABLE, LFric have internal timing infrastructure that can be configured? For example MOM (via FMS) has user-configurable clocks for recording timings with varying levels of granularity. In this way MOM5 (and other FMS models) can give useful timing information from short runs by isolating timings from different sections of the program.

MartinDix commented 2 years ago

The UM has internal timers which give inclusive timing at a high level, e.g. radiation. See the end of /g/data/access/KGO/standard_jobs/ifort20/gadi_intel_um_safe_n48_ga_amip_exp_2day/vn13.0/pe_output/atmos.fort6.pe0. Essentially the same as the MOM timers.

It also has an interface to the DrHook library from ECMWF which gives subroutine level timing via start and end calls in each routine. This can have a significant overhead and isn't used routinely. See /g/data/access/KGO/standard_jobs/ifort20/gadi_intel_um_drhook_safe_n48/vn13.0/drhook.prof.1.

With both of these, I've sometimes added timers around small blocks of code when optimising.

JULES has the DrHook interface but no separate timers. In UM timer output JULES will be included in the boundary layer.

LRFic has something similar to the UM timers.

aidanheerdegen commented 2 years ago

Of course there are also the tools Nic Hannah developed for testing ACCESS-OM2

https://github.com/COSIMA/access-om2/tree/master/test

and MOM5

https://github.com/mom-ocean/MOM5/tree/master/test

The MOM5 tests are run regularly on Jenkins, first MOM5_run

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_run/activity

which runs

module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable && 
module load pbs && \
cd ${WORKSPACE}/test && \
nosetests --with-xunit -s test_run_setup.py && \
qsub qsub_tests.sh && \
nosetests --with-xunit -s test_run_outputs.py

then repro MOM5_bit_reproducibility

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_bit_reproducibility/activity

module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable && \
cd ${WORKSPACE}/test && \
nosetests -s --with-xunit test_bit_reproducibility.py

Similarly for ACCESS-OM2

The reproducibility test is run every week

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/ACCESS-OM2%2Freproducibility/activity

module use /g/data/hh5/public/modules && module load conda/analysis3-unstable && python -m pytest -s test/test_bit_reproducibility.py

Some other jobs have been set up to allow testing PRs, by hand editing the configuration and setting off the job

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/ACCESS-OM2%2Freproducibility_pull_request/activity

https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_PR/activity

dougiesquire commented 1 year ago

I've started trying to think about what this framework might look like. I've included some thoughts below, including a first draft at a potential design. I expect there're issues, but hopefully this will at least be helpful to form some discussion around.

General requirements/constraints

Readily applied to different models managed with different run tools (payu, rose/cylc). The MOM6, MOM5 and ACCESS-OM2 test frameworks above all either rely on, or include, the ability to build and run the model as part of their tests - this is difficult to generalise across models and run tools
Deployable in different environments (Gadi, CI)
Able to help diagnose cause of test failures (e.g. https://github.com/ACCESS-NRI/model-config-tests/issues/84)

Reproducibility testing scope

Has my new build of this model affected model output? Compare the output of a test run with Known Good Output (KGO).

→ Run reasonably rarely (e.g. only on test model configurations, not all model runs), with KGOs updated even more rarely

Performance testing scope

Has my change to this model (or the way that it's run) affected model performance?
Assess scaling of a model
Track model performance through time
Identify bottlenecks

→ Some use cases run frequently (e.g. after every model run), keep track of and compare performance stats through time

Framework design (to get discussion started)

The simplest design is one that performs tests on output from models that have already been built and run. All the framework includes is a set of classes (one per model) that specify where/how to parse model-specific output files, and a test suite for comparing outputs to a database of benchmarks (KGOs, baseline performance metrics). This test suite could be run as a “postprocess” step using the preferred run tool for that model. This means tests are triggered by running model test configurations. test_framework

Thoughts

It could be useful to separate the reproducibility and performance testing frameworks, since their scopes are a little different. Running performance tests could simply write a yml into to the work directory containing PBS info (which presumably will be available for all models) plus additional information. Test suite could compare to previous run, baseline run. Additional tools could plot performance through time, scalability etc.
Need to allow for schemas of KGOs/performance metrics to change through time.
Do we need a github organisation for ACCESS testing?

aidanheerdegen commented 1 year ago

The tool for extracting out repro hashes (or timing info) will always run.

So one option to detect changes is to use git: if directed to the same output in your test data repo then git can tell you if they've changed. Equally all that is required to update them is git commit && git push. I'm not saying this is the best option, but it is one. A downside is that it isn't explicitly checking the semantic contents of the files, just that they have changes at all. So perhaps that isn't suitable.

Using GitHub as a repro and performance data database has a lot going for it, not the least that others could use the same tools and store their own repro and performance data there. GitHub topics could be used to make model performance data discoverable, even from different institutions.

Picking on a few points

Do we need a github organisation for ACCESS testing?

As somewhere to store the repos containing the reproducibility and performance data?

Need to allow for schemas of KGOs/performance metrics to change through time.

Yes! I definitely endorse the idea of adding a schema version as meta-data so that this can be queried and gracefully handled. e.g. if the format is updated and new fields added older versions can be read in and written back out with new fields added where appropriate.

dougiesquire commented 1 year ago

The tool for extracting out repro hashes (or timing info) will always run.

I'm not sure I understand exactly what you mean by this, but I like the idea of using git to check/update changes, at least as a first pass.

As somewhere to store the repos containing the reproducibility and performance data?

Those and the test configurations (e.g. Payu configs, though I'm not sure how this would work for the rose/cylc stuff)

aidanheerdegen commented 1 year ago

When I say the tool for extracting the info will always run: it is a very cheap process, so I'd imagine it always being done, whether or not the reproducibility status is actually being checked. In a way this is passive reproducibility/provenance. The data is always generated and sometimes it is actively checked, but would always be in the git repo, say, so could be queried a later date if required, e.g. forensic analysis to check how and when answers changed.

dougiesquire commented 1 year ago

FYI, I'm starting to flesh out something here:

https://github.com/dougiesquire/morte

aidanheerdegen commented 1 year ago

Love the name. Conjures up post-mortem and of course ..

Mort-cover

micaeljtoliveira commented 1 year ago

After having a chat with @aidanheerdegen regarding some scaling tests I've been running with MOM6, I realized you might be interested in some experience I had with a project I worked on in my previous life.

The project aimed at extracting all the possible information from calculations performed with a wide range of codes (>50), store all that information on a database and develop some tools that make use of that data (e.g., machine learning, data mining, etc). (if you curious, you can have a look at it here and the corresponding code here)

Happy to share some thoughts on how to best structure the code base, how to defined meta-data specifications, writing model parsers, etc.

dougiesquire commented 1 year ago

Thanks @micaeljtoliveira - this looks like a really interesting (and big) project! You're thoughts and experience would be really valuable here. Maybe easiest to start with a chat in-person and go from there? I'll be in Canberra next week.

I'm also interested to hear about what you've been doing with MOM6, as performance testing is bundled in with what we're trying to achieve here.

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/porting-csiro-umui-access-esm1-5-ksh-run-script-to-payu/1611/3

penguian commented 9 months ago

Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate.

I tried running the current mule-cumf to check the restart dump produced by my ESM1.5 runs. The archive.orig run uses the original coe executables and the archive.build-gadi.1 uses executables built using https://github.com/penguian/access-esm-build-gadi .

[pcl851@gadi-login-06 access-esm]$ diff -U0 archive.orig/access-esm/output000/config.yaml archive.build-gadi.1/access-esm/output000/config.yaml
--- archive.orig/access-esm/output000/config.yaml   2024-02-02 11:16:20.000000000 +1100
+++ archive.build-gadi.1/access-esm/output000/config.yaml   2024-02-05 11:05:24.000000000 +1100
@@ -13 +13 @@
-      exe: /g/data/access/payu/access-esm/bin/coe/um7.3x
+      exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/um_hg3.exe
@@ -20 +20 @@
-      exe: /g/data/access/payu/access-esm/bin/coe/mom5xx
+      exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/mom5xx
@@ -28 +28 @@
-      exe: /g/data/access/payu/access-esm/bin/coe/cicexx
+      exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/cice4.1_access-mct-12p-20240205

The mule-cumf tool states that both runs fail validation checks. Is there an earlier version of cumf that can be used to check the output of UM 7.3?

$ mule-cumf archive.orig/access-esm/restart000/atmosphere/restart_dump.astart archive.build-gadi.1/access-esm/restart000/atmosphere/restart_dump.astart 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
* (CUMF-II) Module Information *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

mule       : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py (version 2022.07.1)
um_utils   : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/__init__.py (version 2022.07.1)
um_packing : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_packing/__init__.py (version 2022.07.1) (packing lib from SHUMlib: 2023061)

/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: archive.orig/access-esm/restart000/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
  File            : 145 points from -90.0, spacing 1.25
  Field (Expected): 180 points from -89.5, spacing 1.25
  Field (Lookup)  : 180 points from 89.5, spacing -1.0
Field validation failures:
  Fields (4099,4101,5484,5523)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning: 
File: archive.build-gadi.1/access-esm/restart000/atmosphere/restart_dump.astart
Field validation failures:
  Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
  File            : 145 points from -90.0, spacing 1.25
  Field (Expected): 180 points from -89.5, spacing 1.25
  Field (Lookup)  : 180 points from 89.5, spacing -1.0
Field validation failures:
  Fields (4099,4101,5484,5523)
Skipping Field validation due to irregular lbcode: 
  Field lbcode: 31320
  warnings.warn(msg)
Traceback (most recent call last):
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/mule-cumf", line 10, in <module>
    sys.exit(_main())
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 1385, in _main
    comparison = UMFileComparison(um_files[0], um_files[1])
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 728, in __init__
    diff_field = difference_op([field_1, field_2])
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py", line 952, in __call__
    new_field = self.new_field(source, *args, **kwargs)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 293, in new_field
    data1 = fields[0].get_data()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py", line 730, in get_data
    data = self._data_provider._data_array()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/ff.py", line 193, in _data_array
    data = np.fromstring(data_bytes, dtype, count=count)
ValueError: string is smaller than requested size

access-hive-bot commented 9 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/1

access-hive-bot commented 9 months ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/3

penguian commented 1 week ago

@aidanheerdegen Has ACCESS-NRI decided on how to use rose stem tests to test executable reproducibility? Would this be part of an approach to test model reproducibility?

aidanheerdegen commented 1 week ago

@aidanheerdegen Has ACCESS-NRI decided on how to use rose stem tests to test executable reproducibility? Would this be part of an approach to test model reproducibility?

No decision has been made. I need to read some documentation about rose stem testing, but if you had an example suite for testing on NCI that I could look at it that would be helpful.

penguian commented 1 week ago

@MartinDix uses rose stem to test each each UM release. I believe that this rose stem test is just a subset of the tests in https://code.metoffice.gov.uk/trac/um/browser/main/trunk/rose-stem

MartinDix commented 1 week ago

The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted.

penguian commented 1 week ago

The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted.

That said, if we intend to contribute code changes upstream, we would probably also need corresponding rose stem tests.

ACCESS-NRI / model-config-tests