Open aidanheerdegen opened 2 years ago
Existing tools
Specific tools developed by Paul Leopardi to extract performance stats for scaling testing https://github.com/penguian/performance-analysis
Andrew Kiss' tool for extracting information from ACCESS-OM2 runs to create a run summary https://github.com/aekiss/run_summary
Marshall Ward's tools for performance monitoring of MOM6 https://github.com/NOAA-GFDL/MOM6/tree/dev/gfdl/.testing/tools
UM rose stem testing checks for bitwise reproducibility by comparing output files to Known Good Output (KGO) files on disk (e.g. in /g/data/access/KGO/standard_jobs
) using mule-cumf (compares UM fieldsfiles).
Also compares Helmholtz solver statistics from model log files to those in KGO file.
***************************************************************
* Linear solve for Helmholtz problem *
* Outer Inner Iterations InitialError FinalError *
* 2 1 8 0.122268E+00 0.991602E-04 *
* 2 2 2 0.123264E-02 0.985385E-04 *
***************************************************************
These are effectively a global checksum and even with the limited precision any differences in a run show up in a few hours. Values could be stored in a DB somewhere rather than extracted from the KGO.
There are also comparisons of execution time by extracting run time from model logfiles
PE 0 Elapsed Wallclock Time: 51.48
but in practice elapsed time for short test runs on gadi is too variable for this to be much use.
JULES rose stem uses bitwise comparison of netCDF KGO files with nccmp.
LFRic rose stem uses a simple checksum (global sum of X2) of selected fields calculated during the run.
Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate.
As for extracting performance information, it is likely this would be limited to runs of sufficient length to give useful data.
Do the UM, JULES/CABLE, LFric have internal timing infrastructure that can be configured? For example MOM (via FMS) has user-configurable clocks for recording timings with varying levels of granularity. In this way MOM5 (and other FMS models) can give useful timing information from short runs by isolating timings from different sections of the program.
The UM has internal timers which give inclusive timing at a high level, e.g. radiation. See the end of /g/data/access/KGO/standard_jobs/ifort20/gadi_intel_um_safe_n48_ga_amip_exp_2day/vn13.0/pe_output/atmos.fort6.pe0
. Essentially the same as the MOM timers.
It also has an interface to the DrHook library from ECMWF which gives subroutine level timing via start and end calls in each routine. This can have a significant overhead and isn't used routinely. See /g/data/access/KGO/standard_jobs/ifort20/gadi_intel_um_drhook_safe_n48/vn13.0/drhook.prof.1
.
With both of these, I've sometimes added timers around small blocks of code when optimising.
JULES has the DrHook interface but no separate timers. In UM timer output JULES will be included in the boundary layer.
LRFic has something similar to the UM timers.
Of course there are also the tools Nic Hannah developed for testing ACCESS-OM2
https://github.com/COSIMA/access-om2/tree/master/test
and MOM5
https://github.com/mom-ocean/MOM5/tree/master/test
The MOM5 tests are run regularly on Jenkins, first MOM5_run
https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_run/activity
which runs
module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable &&
module load pbs && \
cd ${WORKSPACE}/test && \
nosetests --with-xunit -s test_run_setup.py && \
qsub qsub_tests.sh && \
nosetests --with-xunit -s test_run_outputs.py
then repro MOM5_bit_reproducibility
module use /g/data3/hh5/public/modules && module load conda/analysis3-unstable && \
cd ${WORKSPACE}/test && \
nosetests -s --with-xunit test_bit_reproducibility.py
Similarly for ACCESS-OM2
The reproducibility test is run every week
module use /g/data/hh5/public/modules && module load conda/analysis3-unstable && python -m pytest -s test/test_bit_reproducibility.py
Some other jobs have been set up to allow testing PRs, by hand editing the configuration and setting off the job
https://accessdev.nci.org.au/jenkins/blue/organizations/jenkins/mom-ocean.org%2FMOM5_PR/activity
I've started trying to think about what this framework might look like. I've included some thoughts below, including a first draft at a potential design. I expect there're issues, but hopefully this will at least be helpful to form some discussion around.
→ Run reasonably rarely (e.g. only on test model configurations, not all model runs), with KGOs updated even more rarely
→ Some use cases run frequently (e.g. after every model run), keep track of and compare performance stats through time
The simplest design is one that performs tests on output from models that have already been built and run. All the framework includes is a set of classes (one per model) that specify where/how to parse model-specific output files, and a test suite for comparing outputs to a database of benchmarks (KGOs, baseline performance metrics). This test suite could be run as a “postprocess” step using the preferred run tool for that model. This means tests are triggered by running model test configurations.
The tool for extracting out repro hashes (or timing info) will always run.
So one option to detect changes is to use git
: if directed to the same output in your test data repo then git
can tell you if they've changed. Equally all that is required to update them is git commit && git push
. I'm not saying this is the best option, but it is one. A downside is that it isn't explicitly checking the semantic contents of the files, just that they have changes at all. So perhaps that isn't suitable.
Using GitHub as a repro and performance data database has a lot going for it, not the least that others could use the same tools and store their own repro and performance data there. GitHub topics could be used to make model performance data discoverable, even from different institutions.
Picking on a few points
Do we need a github organisation for ACCESS testing?
As somewhere to store the repos containing the reproducibility and performance data?
Need to allow for schemas of KGOs/performance metrics to change through time.
Yes! I definitely endorse the idea of adding a schema version as meta-data so that this can be queried and gracefully handled. e.g. if the format is updated and new fields added older versions can be read in and written back out with new fields added where appropriate.
The tool for extracting out repro hashes (or timing info) will always run.
I'm not sure I understand exactly what you mean by this, but I like the idea of using git to check/update changes, at least as a first pass.
As somewhere to store the repos containing the reproducibility and performance data?
Those and the test configurations (e.g. Payu configs, though I'm not sure how this would work for the rose/cylc stuff)
When I say the tool for extracting the info will always run: it is a very cheap process, so I'd imagine it always being done, whether or not the reproducibility status is actually being checked. In a way this is passive reproducibility/provenance. The data is always generated and sometimes it is actively checked, but would always be in the git repo, say, so could be queried a later date if required, e.g. forensic analysis to check how and when answers changed.
FYI, I'm starting to flesh out something here:
Love the name. Conjures up post-mortem and of course ..
After having a chat with @aidanheerdegen regarding some scaling tests I've been running with MOM6, I realized you might be interested in some experience I had with a project I worked on in my previous life.
The project aimed at extracting all the possible information from calculations performed with a wide range of codes (>50), store all that information on a database and develop some tools that make use of that data (e.g., machine learning, data mining, etc). (if you curious, you can have a look at it here and the corresponding code here)
Happy to share some thoughts on how to best structure the code base, how to defined meta-data specifications, writing model parsers, etc.
Thanks @micaeljtoliveira - this looks like a really interesting (and big) project! You're thoughts and experience would be really valuable here. Maybe easiest to start with a chat in-person and go from there? I'll be in Canberra next week.
I'm also interested to hear about what you've been doing with MOM6, as performance testing is bundled in with what we're trying to achieve here.
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/porting-csiro-umui-access-esm1-5-ksh-run-script-to-payu/1611/3
Thanks @MartinDix. There is some documentation on using mule-cumf on the CLEX CMS Wiki. Are the code/scripts that call this and do these other comparisons available somewhere? Doesn't have to be exhaustive, just "best of breed" examples that give some indication of what would be required to copy/emulate.
I tried running the current mule-cumf
to check the restart dump produced by my ESM1.5 runs. The archive.orig
run uses the original coe
executables and the archive.build-gadi.1
uses executables built using https://github.com/penguian/access-esm-build-gadi .
[pcl851@gadi-login-06 access-esm]$ diff -U0 archive.orig/access-esm/output000/config.yaml archive.build-gadi.1/access-esm/output000/config.yaml
--- archive.orig/access-esm/output000/config.yaml 2024-02-02 11:16:20.000000000 +1100
+++ archive.build-gadi.1/access-esm/output000/config.yaml 2024-02-05 11:05:24.000000000 +1100
@@ -13 +13 @@
- exe: /g/data/access/payu/access-esm/bin/coe/um7.3x
+ exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/um_hg3.exe
@@ -20 +20 @@
- exe: /g/data/access/payu/access-esm/bin/coe/mom5xx
+ exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/mom5xx
@@ -28 +28 @@
- exe: /g/data/access/payu/access-esm/bin/coe/cicexx
+ exe: /g/data/tm70/pcl851/src/penguian/access-esm-build-gadi/bin/cice4.1_access-mct-12p-20240205
The mule-cumf
tool states that both runs fail validation checks. Is there an earlier version of cumf
that can be used to check the output of UM 7.3?
$ mule-cumf archive.orig/access-esm/restart000/atmosphere/restart_dump.astart archive.build-gadi.1/access-esm/restart000/atmosphere/restart_dump.astart
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
* (CUMF-II) Module Information *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
mule : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py (version 2022.07.1)
um_utils : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/__init__.py (version 2022.07.1)
um_packing : /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_packing/__init__.py (version 2022.07.1) (packing lib from SHUMlib: 2023061)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning:
File: archive.orig/access-esm/restart000/atmosphere/restart_dump.astart
Field validation failures:
Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
File : 145 points from -90.0, spacing 1.25
Field (Expected): 180 points from -89.5, spacing 1.25
Field (Lookup) : 180 points from 89.5, spacing -1.0
Field validation failures:
Fields (4099,4101,5484,5523)
Skipping Field validation due to irregular lbcode:
Field lbcode: 31320
warnings.warn(msg)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/validators.py:198: UserWarning:
File: archive.build-gadi.1/access-esm/restart000/atmosphere/restart_dump.astart
Field validation failures:
Fields (1114,1115,1116)
Field grid latitudes inconsistent (STASH grid: 23)
File : 145 points from -90.0, spacing 1.25
Field (Expected): 180 points from -89.5, spacing 1.25
Field (Lookup) : 180 points from 89.5, spacing -1.0
Field validation failures:
Fields (4099,4101,5484,5523)
Skipping Field validation due to irregular lbcode:
Field lbcode: 31320
warnings.warn(msg)
Traceback (most recent call last):
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/bin/mule-cumf", line 10, in <module>
sys.exit(_main())
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 1385, in _main
comparison = UMFileComparison(um_files[0], um_files[1])
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 728, in __init__
diff_field = difference_op([field_1, field_2])
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py", line 952, in __call__
new_field = self.new_field(source, *args, **kwargs)
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/um_utils/cumf.py", line 293, in new_field
data1 = fields[0].get_data()
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/__init__.py", line 730, in get_data
data = self._data_provider._data_array()
File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-23.07/lib/python3.10/site-packages/mule/ff.py", line 193, in _data_array
data = np.fromstring(data_bytes, dtype, count=count)
ValueError: string is smaller than requested size
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/1
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/how-to-use-umf-or-mule-cumf-with-access-esm1-5-um-output/1794/3
@aidanheerdegen Has ACCESS-NRI decided on how to use rose stem
tests to test executable reproducibility? Would this be part of an approach to test model reproducibility?
@aidanheerdegen Has ACCESS-NRI decided on how to use
rose stem
tests to test executable reproducibility? Would this be part of an approach to test model reproducibility?
No decision has been made. I need to read some documentation about rose stem
testing, but if you had an example suite for testing on NCI that I could look at it that would be helpful.
@MartinDix uses rose stem
to test each each UM release. I believe that this rose stem
test is just a subset of the tests in https://code.metoffice.gov.uk/trac/um/browser/main/trunk/rose-stem
The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted.
The standard UM rose stem tests aren't a good match for our configurations, e.g. nothing with CABLE and nothing anywhere near as old as ESM1.5. They're convenient for testing the effect of code updates across a range of configurations but if we're interested in changes to released configurations we could use something more targeted.
That said, if we intend to contribute code changes upstream, we would probably also need corresponding rose stem
tests.
As a model component developer I want a tool verify model reproducibility. As a model developer I want the same tool to work for all the components of the model.
As a release team member I want to be able to use the same tool when developing CI tests for a number of different models.
As a user of models I want to be confident that model updates will not change the answers of my experiments, unless this has been specifically documented.