Failing the checksum reproducability masks testing actual reproducability

anton-seaice commented 1 month ago

Thanks to @aekiss for highlighting this.

The current "historical" reproducibility tests check that the current model configuration results are the same as the saved checksum in the configuration repository. There are two issues with this:

its not very strict only looking at the ocean.stats file (i.e. compensating errors would get missed).
when it fails (which is often, i.e. every time the configuration is changed), it doesn't test the new configuration is reproduceable with itself. To confirm reproducibility, we need to run the configuration under test twice (under identical conditions).

aekiss commented 1 month ago

To clarify further, there are (at least) three distinct things we need to test for:

determinism - whether the same model produces identical results when run under identical conditions (including the same length of run)
integrity of restarts - whether an experiment produces identical results regardless of whether it is conducted as a single model run or several, i.e. whether restarts precisely capture the complete model state
model regressions - whether a change in the model code or configuration has unexpectedly changed the results

These form a hierarchy - we can't correctly interpret and investigate a failure of test 2 or 3 unless the model has first passed test 1 (and failures of test 1 can happen: we saw reproducible failures of test 1 in early versions of ACCESS-OM3 and occasional failures of test 1 in production runs of ACCESS-OM2).

aekiss commented 1 month ago

The next question is how to test for these three model properties. I wonder why ocean.stats is being used, rather than the restart md5 hashes from the payu manifest?

Checksums and ocean.stats cannot give a watertight guarantee, because there may be roundoff in ocean.stats or compensation such as alternating errors which leave a checksum unchanged. Restart manifest md5 hashes offer a much stronger guarantee, essentially as good as a binary diff on the restarts, but much faster (in principle they may suffer the same cancellation problem, or be incorrect due to the payu logic that decides whether to calculate the md5 hash based on differences in the binhash, but this is vanishingly unlikely).

Of course, checks based on restarts rely on the restarts actually capturing the complete model state, i.e. test 2 passing.

anton-seaice commented 1 month ago

Thanks to @dougiesquire for pointing out there are two tests marked "checksum_slow", which cover items 1. and 2. in @aekiss' list.

The test marked "checksum" covers item 3.

Only tests marked "access_om2" (i.e. the qa tests) and those marked "checksum" are run in CI, but maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?)

aidanheerdegen commented 1 month ago

maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?)

It's not just compute cost. The tests use the PBS queuing system, which can be slow/unpredictable, and so isn't a great test to be running routinely for CI tests where rapid answers are the norm. The counter to this is that it is possible to just cancel a test if you don't need the results, but I'm generally in favour of the default automatic behaviour of these systems to be the one that is most commonly used, and not require human intervention, because human.

We also talked about putting in some "on demand" repro testing via comment commands. I think we'll do this, just a question of prioritisation.

anton-seaice commented 1 month ago

I am wondering if running them at low resolution is ok (fairly fast) and skipping at high resolution (on the assumption the binary is the same at both resolutions) makes sense. Also, running only when the historical checksum test fails avoids waiting for the test unless its needed.

dougiesquire commented 1 month ago

I wonder why ocean.stats is being used, rather than the restart md5 hashes from the payu manifest?

This was simply just to get something in place, because something is better than nothing. This is what is done in MOM6's own regression testing. But I agree something more robust is better. We're also not currently testing any CICE output, but we should.

aekiss commented 1 month ago

Fair enough. The manifest hashes would cover all model components, so that's another good reason to use them.

aekiss commented 1 month ago

However, it's worth noting that the barotropic restarts in ACCESS-OM2 did not have reproducible md5 hashes, but since all the other components did reproduce this was not investigated further. Might just be something like a run timestamp in the barotropic restarts.

ACCESS-NRI / model-config-tests

Failing the checksum reproducability masks testing actual reproducability #39