Open anton-seaice opened 1 month ago
To clarify further, there are (at least) three distinct things we need to test for:
These form a hierarchy - we can't correctly interpret and investigate a failure of test 2 or 3 unless the model has first passed test 1 (and failures of test 1 can happen: we saw reproducible failures of test 1 in early versions of ACCESS-OM3 and occasional failures of test 1 in production runs of ACCESS-OM2).
The next question is how to test for these three model properties. I wonder why ocean.stats
is being used, rather than the restart md5 hashes from the payu manifest?
Checksums and ocean.stats
cannot give a watertight guarantee, because there may be roundoff in ocean.stats
or compensation such as alternating errors which leave a checksum unchanged. Restart manifest md5 hashes offer a much stronger guarantee, essentially as good as a binary diff on the restarts, but much faster (in principle they may suffer the same cancellation problem, or be incorrect due to the payu logic that decides whether to calculate the md5 hash based on differences in the binhash, but this is vanishingly unlikely).
Of course, checks based on restarts rely on the restarts actually capturing the complete model state, i.e. test 2 passing.
Thanks to @dougiesquire for pointing out there are two tests marked "checksum_slow", which cover items 1. and 2. in @aekiss' list.
The test marked "checksum" covers item 3.
Only tests marked "access_om2" (i.e. the qa tests) and those marked "checksum" are run in CI, but maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?)
maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?)
It's not just compute cost. The tests use the PBS queuing system, which can be slow/unpredictable, and so isn't a great test to be running routinely for CI tests where rapid answers are the norm. The counter to this is that it is possible to just cancel a test if you don't need the results, but I'm generally in favour of the default automatic behaviour of these systems to be the one that is most commonly used, and not require human intervention, because human.
We also talked about putting in some "on demand" repro testing via comment commands. I think we'll do this, just a question of prioritisation.
I am wondering if running them at low resolution is ok (fairly fast) and skipping at high resolution (on the assumption the binary is the same at both resolutions) makes sense. Also, running only when the historical checksum test fails avoids waiting for the test unless its needed.
I wonder why ocean.stats is being used, rather than the restart md5 hashes from the payu manifest?
This was simply just to get something in place, because something is better than nothing. This is what is done in MOM6's own regression testing. But I agree something more robust is better. We're also not currently testing any CICE output, but we should.
Fair enough. The manifest hashes would cover all model components, so that's another good reason to use them.
However, it's worth noting that the barotropic restarts in ACCESS-OM2 did not have reproducible md5 hashes, but since all the other components did reproduce this was not investigated further. Might just be something like a run timestamp in the barotropic restarts.
Thanks to @aekiss for highlighting this.
The current "historical" reproducibility tests check that the current model configuration results are the same as the saved checksum in the configuration repository. There are two issues with this:
ocean.stats
file (i.e. compensating errors would get missed).