Proposal for regression testing on multiple platforms

(If I get feedback on this ticket I will revise this original post but will indicate updates at the bottom).

We have a major imminent issue, which was anticipated, that involves the matter of regression testing on multiple platforms. The issue is that the regression answers are generated and updated on GFDL's HPC and a developer on a different platform, or with different compiler versions, can not reproduce those regression answers. It is clear that the current design is not sustainable as MOM6 expands to more and different platforms.

The original concept of version controlling code, inputs, and answers all together has proven very successful by many measures (e.g. we know when we broke something, who to blame and what to fix) even though it has added a burden to the development workflow. There are still gaps, such as the lack of versioning of very large input datasets. At GFDL, FRE is filling in some gaps, such as recording environments, but not all and it does not solve the imminent multi-platform regression issue. We are very certain that the general concept of version controlling everything is correct. However, we do recognize that the implementation needs some tweaking.

So we'd like input on a proposal, namely ideas 5 and 6 below, and discussion on relevant issues.

Considerations:

RA = Regression answers, meaning the content of ocean.stats and seaice.stats files.

RA are primarily used for checking that code modifications and new features do not unintentionally alter solutions. These tests are currently passed down to the last bit.
In addition to code and model inputs, RA are dependent on platform, compiler, optimization and compiler options, and compiler version. There are many many combinations of these variables.
RA do change due to bug fixes or decisions about best practice (configuration choices).
Version control of RA, inputs, and code, all together is essential.
New feature code should be tested and provided with RA.
Checking and generating RA needs to have a short turn around to be useful. We currently live with about 15 minutes.
It is unlikely that GFDL developers or external developers will have direct access to each other's, nor all, platforms and thus will not be able to update RA for any but their own platforms. Even if GFDL had access, the range of platforms would make the exercise prohibitive.
Some ideas:

1. Adopt a more fuzzy regression approach

There is a notion that we could measure the expected error between platforms/compilers, etc. and then create tools to check that a model answer fall within bounds. This seems doable - we recently attempted to make such a measurement when changing platforms at GFDL and it was a lot of work because the tools don't exist. Further, many of the bugs we catch are due to the least significant bit changing, meaning the dynamical error growth is small compared to the computational error for the duration of our tests (which are deliberately short). Such bugs would fly under the radar.

2. Different branches of MOM6-examples for different platforms

Using a different branch of MOM6-examples to track answers on different platforms seems sensible. The challenge here is that if we change the inputs or code that leads to a change in RA, then they should all be in the same commit. On a different branch, we would need to merge in the inputs/code changes but no the changes to the RA. This idea fails because it requires breaking the git paradigm of all related changes belonging to one commit.

3. Separate MOM6-examples repositories for each platform

Here we envision a MOM6-examples for GFDL's HPC and another repository for an Ubuntu Linux platform, etc. This suffers from the same problem as branches above - merging a commit from one repo to another brings the changes to the wrong RA with it.

4. Move the RA into a submodule (their own repository)

Currently the RA files (ocean.stats) live in the example directory with the inputs. In this solution we would create a new repository for the RA alone with a directory tree mirroring MOM6-examples. We then add the RA repository as a submodule to MOM6-examples. When inputs/code change the RA, the new submodule commit id is included in the commit with the changed inputs. Oddly, this still suffers from the merge dilemma - merging from one platform branch to another would bring the commit id of the RA submodule and thus be wrong.

5. Create a RA repository and add MOM6-examples as a submodule

In this solution, the RA repository would have a directory tree mirroring MOM6-examples containing just the RA files and at the top it would have a submodule link to MOM6-examples. And here's the trick: there is no need for inter-branch merges. Imagine we change inputs in MOM6-examples. We update the RA files and add the submodule commit id on branch "platform_A". On branch "platform_B" nothing has changed - it still points to an older commit id and has the correct RA. Now a dev on platform B can update MOM6-examples, generate the new RA and make a commit. No merging needed. The commit id of the single submodule (MOM6-examples) is like a key (tag I guess) for which the RA either exist on a platform or they don't.

6. As for 5 but use an RA for each platform

The disadvantage of a single RA (in the NOAA-GFDL organization) is that branches are only writable by GFDL. Forks would solve this but why bother, since merges and pull requests are meaningless? On the other hand, forks are at least linked and you could find all the RA repositories.

Edits will be noted here. This OP will hopefully evolve into a plan.

I inadvertently posted a very incomplete draft as #94. This post replaces that ticket.

Reading through this thoughtful discussion, I see the value of approaches #5 or #6, as well as the need to think about how to address these emerging issues as we consider quality control across multiple platforms at multiple institutions. However, I would like to see some more clarification of the pros and cons between #5 (one set of RAs) and #6 (multiple sets of RAs). It strikes me that #5 could be the pathway to #6, with the second set of RAs emerging only once they are needed.

Some input from folks from outside of GFDL would be valuable on this point.

RA = Regression answers, meaning the content of ocean.stats and seaice.stats files.

Thanks for the summary. I agree that 1-4 are less desirable - particularly the extra effort needed to maintain changes across branches/forks. I like the elegance of #5. Just to clarify, do you imagine the RA repository would have separate files for each platform+compiler combination?

There is possibly a #7 option which concerns finding a platform-independent way to generate answers. I've been experimenting with this using a software implementation of floating point arithmetic which avoids the FPU altogether (see Berkely SoftFloat http://www.jhauser.us/arithmetic/SoftFloat.html). So far it's looking promising, but I've only tried toy problems on a few platforms. This approach would also come with some downsides: 1) it's likely to be quite slow, I'm guessing 5-10x, 2) it would require a non-standard build including compiling the soft fp library itself.

Also, I was interested to hear that answer changes are often in the least significant bit. So it's a question of running tests just long enough to detect a computational error? I'm wondering whether we could try to reduce the runtime by using some (or all) long doubles (128 bit).

I think that the idea of getting machine-independent answers by avoiding the FPU altogether is interesting, but probably not the right solution for all our testing, in that we are commonly interested in ensuring that the optimized answers on any given machine are reproducible. I would think of this as being the equivalent of another (virtual) machine, but with the great virtue of being reproducible across hardware platforms. In other words, I see this idea as a potential complement to our existing approach to testing, not a replacement.

The answer changes between platforms do usually start out in the least significant bits in MOM6, as the result of a lot of very careful coding practices. To explain how we have achieved this, and to explain why we think that it is generally impossible to do even better, it is worthwhile to discuss 3 distinct sources of differences across machines and compilers.

First, we have the order of arithmetic differences. We can and do control this by the use of parentheses (assuming that compilers respect parentheses). This is particularly important when we are taking sums of 3 or more quantities that can be of either sign, because in this case the answers are not guaranteed to be in the least sgnificant bits. For example 10^20 - 10^20 + 1 might return either 0 or 1 with 64 bit floating point arithmetic, depending on the order of the sums. In some cases, like in the denominators of the MOM6 tridiagonal solvers, there is a right order of arithmetic that gives the right answer, and we carefully enforce this with parentheses and by introducing temporary variables that ensure that only the most devious compiler optimization would get the wrong answer. In other cases, like the spatial averaging of 4 points, MOM6 uses the sets of parentheses that give an answer that is invariant to the rotation of the problem. In still other cases, like the sum of 3 or more tendencies in the momentum equations, there is no "right" order, but we still use parentheses so that we get the same order of sums, regardless of compiler settings. When taking products of multiple terms (but not exact powers of 2), the answers can differ in the least significant bit; while this can be controlled with parentheses, in MOM6 have not been systematic about forcing the products to be taken in the same order. We also use a special extended-fixed-point representation to get order-invariant sums across processors for things like global energy metrics and tests of conservation properties (see Hallberg & Adcroft, 2014, Parallel Computing).
Different compilers use different math libraries for transcendental functions, like sin, cos, or tanh, and may chose to use either libraries or hardware-encoded algorithms (on some machines) for common functions sqrt, exp and even division. Although the differences in these functions should only show up in the least sigificant bits, we do not know of any practical way to control these differences. Hence, different compilers will always give different answers.
Different machines use algorithms for floating point operations. For example, they may use more bits inside of the chip, even if the result that is stored to memory is in a standard 64-bit format. Because of this, we expect that different machines will always give different answers, even when the same compiler and compiler settings are used. However, it is our expectation that these differences will also arise in the least signficant bits.

However, in an ocean model, we are dealing with nonlinear (often chaotic) systems of equations with discrete logical branches, and we are often interested in tests that run long enough to ensure that even subtle differences in macroscopic metrics will be detected. We therefore do not expect differences that arise in the last significant bits will stay there, nor that solutions that differ even at leading order after a while are necessarily wrong. This is why our testing has emphasized bitwise identical reproduction of answers on whatever machines and compilers will actually be used, and why I think it must continue to do so.

Bob,

I added an edited version of this very informative email to the MOM6 wiki.

Thanks, Steve

On Tue, May 31, 2016 at 10:32 AM, Robert Hallberg notifications@github.com wrote:

I think that the idea of getting machine-independent answers by avoiding the FPU altogether is interesting, but probably not the right solution for all our testing, in that we are commonly interested in ensuring that the optimized answers on any given machine are reproducible. I would think of this as being the equivalent of another (virtual) machine, but with the great virtue of being reproducible across hardware platforms. In other words, I see this idea as a potential complement to our existing approach to testing, not a replacement.

The answer changes between platforms do usually start out in the least significant bits in MOM6, as the result of a lot of very careful coding practices. To explain how we have achieved this, and to explain why we think that it is generally impossible to do even better, it is worthwhile to discuss 3 distinct sources of differences across machines and compilers.

1.

First, we have the order of arithmetic differences. We can and do control this by the use of parentheses (assuming that compilers respect parentheses). This is particularly important when we are taking sums of 3 or more quantities that can be of either sign, because in this case the answers are not guaranteed to be in the least sgnificant bits. For example 10^20 - 10^20 + 1 might return either 0 or 1 with 64 bit floating point arithmetic, depending on the order of the sums. In some cases, like in the denominators of the MOM6 tridiagonal solvers, there is a right order of arithmetic that gives the right answer, and we carefully enforce this with parentheses and by introducing temporary variables that ensure that only the most devious compiler optimization would get the wrong answer. In other cases, like the spatial averaging of 4 points, MOM6 uses the sets of parentheses that give an answer that is invariant to the rotation of the problem. In still other cases, like the sum of 3 or more tendencies in the momentum equations, there is no "right" order, but we still use parentheses so that we get the same order of sums, regardless of compiler settings. When taking products of multiple terms (but not exact powers of 2), the answers can differ in the least significant bit; while this can be controlled with parentheses, in MOM6 have not been systematic about forcing the products to be taken in the same order. We also use a special extended-fixed-point representation to get order-invariant sums across processors for things like global energy metrics and tests of conservation properties (see Hallberg & Adcroft, 2014, Parallel Computing). 2.

Different compilers use different math libraries for transcendental functions, like sin, cos, or tanh, and may chose to use either libraries or hardware-encoded algorithms (on some machines) for common functions sqrt, exp and even division. Although the differences in these functions should only show up in the least sigificant bits, we do not know of any practical way to control these differences. Hence, different compilers will always give different answers. 3.

Different machines use algorithms for floating point operations. For example, they may use more bits inside of the chip, even if the result that is stored to memory is in a standard 64-bit format. Because of this, we expect that different machines will always give different answers, even when the same compiler and compiler settings are used. However, it is our expectation that these differences will also arise in the least signficant bits.

However, in an ocean model, we are dealing with nonlinear (often chaotic) systems of equations with discrete logical branches, and we are often interested in tests that run long enough to ensure that even subtle differences in macroscopic metrics will be detected. We therefore do not expect differences that arise in the last significant bits will stay there, nor that solutions that differ even at leading order after a while are necessarily wrong. This is why our testing has emphasized bitwise identical reproduction of answers on whatever machines and compilers will actually be used, and why I think it must continue to do so.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-GFDL/MOM6-examples/issues/95#issuecomment-222707042, or mute the thread https://github.com/notifications/unsubscribe/ACBam3UF0SdbBTU-TA0oz840gMYMr5F4ks5qHEaJgaJpZM4InJxp .

Dr. Stephen M. Griffies NOAA Geophysical Fluid Dynamics Lab 201 Forrestal Road Princeton, NJ 08542 USA

@nicjhan asked:

Just to clarify, do you imagine the RA repository would have separate files for each platform+compiler combination?

The git philosophy (and actuality) requires a commit to be complete. Only RA files that can be generated at once and together should and could make up a commit. Updating RA files from different platforms in the same repository in the same commit is very hard to imagine. I therefore think we would normally not have multiple platform RAs live together (on the same branch at least). If you have multiple platforms with access to the same file system and you can deal with the workflow of running everything across those platforms for each test then they could live together and this would even be encouraged since it gives greater coverage for the testing.

Multiple compilers (and versions) can live together because, as we demonstrate routinely, it easy to maintain multiple executables and RAs that live side-by-side. To incorporate multiple versions of compilers we need only modify the labels we currently use for the RA files. BTW, is it likely multiple versions of the same compiler will catch bugs?

Following that, I'm advocating that we adopt a repository per platform, or per group of platforms with access to the same file system, or per institution, on a case by case basis. The main benefit is it allows other users to adopt the regression testing practices we use at GFDL without needing access to each other's platforms.

@nijhan wrote:

There is possibly a # 7 option which concerns finding a platform-independent way to generate answers. I've been experimenting with this using a software implementation of floating point arithmetic which avoids the FPU altogether (see Berkely SoftFloat http://www.jhauser.us/arithmetic/SoftFloat.html). So far it's looking promising, but I've only tried toy problems on a few platforms. This approach would also come with some downsides: 1) it's likely to be quite slow, I'm guessing 5-10x, 2) it would require a non-standard build including compiling the soft fp library itself.

I like this idea. If this works and is deployable for others then I would certainly like to adopt this as a way to validate code installs and modifications. I agree with @Hallberg-NOAA's first comment; the inefficiency and disconnect from compilers/chipsets used for production means we will still need to regression test as we are currently doing. So I think we should pursue 6 and 7.

To illustrate ideas 5 and 6, I've re-created such a regression repository for the history of commits in MOM6-examples since we switched platforms to c3. I wish we'd thought of this earlier because it would have allowed us to have c1 and c3 RAs co-exist for the brief time the machines did both exist with access to the same file system.

FWIW, @nicjhan's idea # 7 could use a public/universal repository like this since everyone should be able to reproduce the RAs in it.

Things to note:

Since we switched to c3 (Apr 26), there have been:
- 133 commits to MOM6, only 6 of which changed answers in one or more experiments.
- 47 commits in MOM6-examples (most of which are MOM_parameter_doc updates).
- 14 commits in the regression repo (i.e. changes to answers).
- The commits that changed answers are therefore a lot easier to sift out.
- 6 commits due to code changes (see above).
- 6 commits where a configuration was added or altered.
- 2 commits are due to an bad commit and reversion.
The README contains the command-line workflow which I hope is obvious.
This is a demonstration repo. I'll scrap it once we decide the direction we'll take.

Just a brief note. I'm making progress with the idea of running the model using (reproducible) software FP.

NOAA-GFDL / MOM6-examples