Broken rotational symmetry and inefficiency with USE_LEITHY=True

NCAR / MOM6

NCAR/CESM fork of the Modular Ocean Model v.6 (MOM6)

Other

2 stars 19 forks source link

Broken rotational symmetry and inefficiency with USE_LEITHY=True #260

Closed Hallberg-NOAA closed 5 months ago

Hallberg-NOAA commented 11 months ago

Although I believe it to be mathematically correct, there are two problems with the recently added smooth_x9() routine in MOM_hor_visc.F90 that is used when USE_LEITHY = True.

Because smooth_x9() uses the array-syntax sum() function to add up contributions in the horizontal, we have no control over the order of arithmetic, and it is virtually certain that this will break the rotational symmetry of these calculations. In MOM6, we use rotational symmetry to detect horizontal indexing bugs and inconsistencies between the implementation of the u- and v-components of certain expressions. Fixing this would be relatively straightforward with a mathematically equivalent refactoring that does preserve rotational symmetry, but it would change answers and hence the strategy for fixing this problem will depend on whether this code (which is enabled by setting the runtime parameter USE_LEITHY to true) is being used yet in any answers that need to be preserved. Please let us know whether this option is in active use yet, in which case a new ANSWER_DATE flag will be needed to preserve the existing answers.
When USE_LEITHY is true, the routine smooth_x9() is called 4 times for each model layer, and there are 2 blocking halo update calls on 2-d arrays within each call to smooth_x9(). For a 75-layer model, there will be a total of 600 more blocking halo updates added by setting USE_LEITHY=True, which is more than the rest of the model combined. As a result, the smoothing with USE_LEITHY=True is likely to be the single most expensive operation in MOM6 for some configurations. This excessive number of halo updates could be avoided without changing any answers by either working inward from wider halos that are updated before horizontal_viscosity() is called or by refactoring horizontal_viscosity() so that all of the layers' halos can be updated simultaneously with 3-d halo updates.

iangrooms commented 11 months ago

Sorry for the delayed response, and thanks to @gustavo-marques for pointing me to this issue.

We don't have any answers that need preserving at this point, thankfully. Is the reproducing_sum function the answer to this? I.e. if I replace sum with reproducing_sum will it solve the problem? If so, then instead of having you fix it, I would like to suggest that I can push the fix along with some other changes that I've made. I found and fixed one bug in the implementation of Leith+E, and I also found one bug in the implementation of biharmonic Leith. I've fixed both here. I can update smooth_x9 to use reproducing_sum and push that along with the other fixes.
Leith+E requires smoothing the velocity. Fortunately it's easy to move that outside the loop over k (at the expense of extra memory), which exchanges a halo update on a 3D array for many halo updates on 2D arrays. The other two calls (which smooth the biharmonic coefficient and the "m" coefficient that defines the backscatter coefficient via Kh = m * Ah) are not strictly necessary. I will experiment with removing the smoothing of Ah and m_leithy. I 100% agree that it is important to address this, but would like to mention that we're getting good performance in the candidate 1/4 degree grid with 75 layers, so the halo updates are not currently slowing us down much.
My 'fix' for biharmonic Leith, mentioned above, introduces another halo update inside the do k=1,... loop in MOM_hor_visc.F90. In light of point 2 (above & below), that needs to be fixed. Fortunately it can be moved outside the do k=1,... loop without too much effort.

Hallberg-NOAA commented 11 months ago

Thanks for your taking this issue on, @iangrooms and @gustavo-marques. I think that the two sub-issues from the original comment can be addressed in separate comments. With this comment I will just address the paired points 1. (I am going to belabor this point a bit for clarity, although I suspect that you two will get the key point long before reaching the end of this comment!

Although using the MOM6 reproducing sums facility would give rotationally symmetric (and accurate) answers by being order invariant, this would be a more complicated and slightly more expensive solution, and it is not necessary for rotational symmetry. Instead, I would suggest grouping the 9 terms into pairs with parentheses so that the order is rotationally invariant, exploiting the structure of the grid.) I will illustrate these rotationally symmetric sums first with a 4 point average for clarity before giving the solution for the 9-point solution.

The basic problem is that with floating point math, although a + b = b + a, we do not have that (a + b) + c = a + (b + c). This is illustrated most vividly by setting a=-1, b=1 and c=1e-20. The left hand side of the three-point sum becomes (-1 + 1) + 1e-20 = 0 + 1e-20 = 1e-20, which of course is the right answer, but after taking 64-bit real roundoff into account the right hand side is equivalent to -1 + (1 + 1e-20) = -1 + 1 = 0. If you then take the reciprocal of this sum, this difference can have dramatic consequences, as 1e20 and NaN behave very differently! So in MOM6 we always group pairs of addition of real numbers so that we control the answers we get, rather than leaving it up to the compiler. (Note that for integers none of this applies.) In some cases this choice is arbitrary, but in the case of the rotationally symmetric horizontal averaging there is a correct choice.

If we were doing a 4-point average (e.g. to interpolate thicknesses to a vorticity point), instead of writing h_avg(I,J) = sum(h(i:i+1,j:j+1)) * 0.25, which a compiler might interpret as h_avg(I,J) = ( ((h(i,j) + h(i+1,j)) + h(i,j+1)) + h(i+1,j+1) ) * 0.25, to obtain a rotationally invariant solution we would write this out explicitly as h_avg(I,J) = ( (h(i,j) + h(i+1,j+1)) + (h(i+1,j) + h(i+1,j+1)) ) * 0.25, which gives the same order of paired sums when rotated by 90 or 180 degrees. Graphically if we draw a line through the points in the order of the sums, the order of arithmetic in the first version looks like a backwards 'Z', which when rotated looks like an 'N'. By contrast the reproducing sum looks like an 'X', which is still an 'X' when rotated. If we had grouped this instead as h_avg(I,J) = ( (h(i,j) + h(i+1,j)) + (h(i,j+1) + h(i+1,j+1)) ) * 0.25, this would be invariant to a 180 degree rotation but not 90, and the pattern of the sums would look like an '=', which when rotated becomes '||'.

So for the 9-point filter here we can obtain a rotationally symmetric sum by grouping the sums of the diagonal points, the sums of the axis points and the central point. That is, instead of writing a_sum(i,j) = sum(a(i-1:i+1,j-1:j+1), which allows the complier to determine the order of the sums and is not likely to be rotationally symmetric, we might write this as the mathematically equivalent expression

a_sum(i,j) = a(i,j) + ( ( (a(i-1,j-1) + a(i+1,j+1)) + (a(i-1,j+1) + a(i+1,j-1)) ) + ( (a(i-1,j) + a(i+1,j)) + (a(i,j-1) + a(i,j+1)) ) )

or equivalently as

a_sum(i,j) = a(i,j) + ( ( (a(i-1,j) + a(i+1,j)) + (a(i,j-1) + a(i,j+1)) ) + ( (a(i-1,j-1) + a(i+1,j+1)) + (a(i-1,j+1) + a(i+1,j-1)) ) )

which are rotationally symmetric. Graphically the order of sums here looks like an overlain '.', 'x', and '+', each part of which is rotationally symmetric for sums.

Hopefully all this is clear, but I would also be more than happy to work with you further on implementing the rotationally symmetric version of the real sums for use in smooth_x9() if that would be helpful.

Hallberg-NOAA commented 11 months ago

Regarding your points 2. and 3., @iangrooms, the question of whether the extra repeated blocking halo updates are going to be expensive can be very machine and problem dependent, depending on the relative latency of communication and the speed of the calculations. Moving the halo updates outside of the smooth_x9() routine and then outside of the do k loop whenever possible should not change answers, and are almost certainly no-regrets changes

As for smoothing the output Ah and m_leathy arrays, the lowest hanging fix is to do their halo-updated together as essentially a single halo update outside of smooth_x9() using the complete=.false. and complete=.true. optional arguments to pass-var(), with a run-time argument to specify whether to do the smoothing and related halo updates at all. If the smoothing does turn out to be generally useful, we could consider a broader refactoring of horizonal_viscosity() to enable this halo update to occur outside of the k-loops.

As for the fix to biharmonic Leith, the other thing that we should explore is whether we could avoid the extra halo update by working on a wider stencil for some of the preliminary calculations.

I also notice when looking at the draft bug fix code that you pointed to that there are some instances with expressions that mix the case of the indices when assigning arrays, as in Kh(I,J) = Kh_h(i,j,k) or Kh(I,J) = Kh_h(i+1,j+1,k). Although Fortran is case-insensitive, in MOM6 we use the case of indices to discriminate between the various points on the staggered grid, as discussed at https://github.com/NOAA-GFDL/MOM6/wiki/Code-style-guide#soft-case-convention. This means that a thickness should be h(i,j,k) while the PV of a layer, which is staggered at the corners of the tracer grids would be q(I,J,k), and the velocity components are u(I,j,k) and v(i,J,k). If we stick to this convention, it makes it much easier to spot horizonal indexing bugs, like the one you appear to be correcting at lines 1470 and 1578 of your corrected code. It also means that expressions like do J=Jsq-2,Jeq+2, j=Jsq-1,Jeq+2 or do J=js-2,je+1 could be fine, but do J=Jsq-2,Jeq+1 would be very suspicious due to its apparent symmetry breaking of the halo points, and if I spotted that in the code (this occurs repeatedly in the draft fixes to the biharmonic Leith code, starting at line 452), I would be inclined to examine it closely to determine whether it was the code that is wrong or the case-sensitive indexing convention that was not being followed. Additionally, on a symmetric-memory grid with our indexing conventions, Jsq=js-1 and Jeq=je, so do J=Jsq-2,Jeq+1 would be equivalent to do J=js-3,je+1, which is decidedly lopsided. If we needed this do-loop to make the code work, I would be looking further up in the code for something else that is using the wrong indexing and loop ranges; I always find it better to work backwards through the code to determine the proper ranges of do-loops.

iangrooms commented 11 months ago

Thanks @Hallberg-NOAA for the quick and clear responses. Since the update to smooth_x9 is not the optimal solution to problem 1 I would be happy to work with you to make that update independent of other changes to the code (i.e. the halo updates &, bug fixes referenced in the rest of the comment thread).

Before diving into that, I wonder if MOM6 might want to have something like smooth_x9 as part of its general utilities: would it be worthwhile to re-write this and put it somewhere other than the horizontal viscosity module? I have thought of a few ways to change the subroutine to make it more general. For example one could have the option to pass in the desired 3x3 weight matrix rather than use the built-in one that it currently uses. Also, one could have an option to preserve area-integrated quantities by multiplying the weights by the cell areas. I suppose it could also have the option to either perform halo updates inside the subroutine or to assume that they have already been performed outside.

iangrooms commented 11 months ago

Regarding the halo updates, thanks for the suggestion to halo-update Ah and m_leithy outside the call to smooth_x9, at the same time, with the complete flags. It should be easy to do that, as well as to add the run-time option to smooth or not smooth. I will run some tests to see whether the code still works well without smoothing m_leithy and Ah before proposing any permanent changes.

I do think correct indexing conventions would have helped find the bug that I fixed on lines 1470 and 1578. I suppose the correct convention here would be Kh(I,J) = Kh_h(i,j,k) because I am moving values from h points in Kh_h to values at q points in Kh?

I find the do loop index ranges throughout MOM-hor_visc.F90 to be opaque. Since Leith+E almost entirely re-uses or re-computes smoothed versions of quantities used in biharmonic Leith, in my original implementation I just copied the loops from biharmonic Leith. Gustavo asked me to look at biharmonic Leith since it was not conserving and not reproducing across PE layouts, so I worked backwards through the code to find problems, ultimately tracking it down to the loop that computes dvdx and dudy.

(Leith+E) On lines 852 and 856 I changed the loop indices so that only the values that are actually used later in the code are computed. This leads to some asymmetric indexing.
(Biharmonic Leith) On lines 870 and 875 I extended the loops, but as I look back I don't think they need to be extended. I don't want to say for sure until I run the tests again with the original loops, but I think the new values that I am computing with the extended ranges are never accessed.

iangrooms commented 10 months ago

I have updated my fix_biharmonic_leith branch here. The updates only address the halos/blocking communications; I have not addressed the broken rotational symmetry. All code has been tested in the double-gyre case to make sure that it runs and reproduces across PE layouts. Here's a summary of the changes:

Leith+E requires smoothing u and v. The smoothing itself does not require a halo update, but smoothed values outside the computational domain need to be accessed, and this requires a halo update. The original code smoothed layer-by-layer. I have now moved this code outside the main loop over layers. Where the original code required 2 * nz blocking 2D halo updates, the new code requires one blocking 3D update.
Leith+E smoothes m_leithy and Ah layer-by-layer, which requires many blocking halo updates. In the discussion above it was suggested that maybe they could be updated together to reduce the number of blocking communications. It turns out this is not possible. The smoothed value of m_leithy is used to compute the value of Ah, which is then smoothed. I added a runtime argument SMOOTH_AH. When True, both m_leithy and Ah are smoothed, which can be costly because of the communication. When False neither m_leithy nor Ah are smoothed, so there is no communication required. I have not tested the un-smoothed versions beyond some short double-gyre runs. I also eliminated a halo update on Kh_h by re-indexing a couple of assignments (this halo updated is eliminated regardless of the value of SMOOTH_AH.)
Both biharmonic Leith and Leith+E require halo updates of dvdx and dudy because their computational stencils extend outside the valid range assumed by horizontal_viscosity. I used one of the suggestions above by making the first halo update use complete=.false. and the second one use complete=.true. While this doesn't eliminate communication, it does eliminate one blocking communication. Leith+E requires an additional halo update of dvdx_smooth and dudy_smooth and I used the same trick there.
I added some code to make Leith+E work with open boundary conditions. I have not tested this code.

With these changes, biharmonic Leith requires 2*nz 2D halo updates in every call to horizontal_viscosity, of which nz are blocking.

Leith+E requires one 3D blocking halo update on a vector, plus 4 * nz 2D halo updates, of which 2 * nz are blocking. If SMOOTH_AH=False then Leith+E requires no further communication. If SMOOTH_AH=True then Leith+E requires another 4 * nz blocking 2D halo updates on each call to horizontal_viscosity.

Hallberg-NOAA commented 10 months ago

@iangrooms, I have added a pull request (github.com/iangrooms/MOM6/pull/1) that should address the rotational symmetry issues with the LEITHY code. It also reduces the number of halo updates and it fixes some horizonal indexing bugs and issues with reproducibility across layouts (especially in non-symmetric-memory mode). For now I have retained the previous filtering code with a new run-time parameter to switch between the two options, however if no one was actively using that code yet in solutions that need to be preserved, we might consider eliminating the previous smoothing code (your call). Note, though, that I did fix some obvious bugs directly assigning thickness-point viscosity values to vorticity points without adding an option to retain the old bugs. We might need to revisit this if there are LEITHY solutions that we want to retain.

Hallberg-NOAA commented 10 months ago

Given that the previous code had unambiguous bugs when USE_LEITHY = True (assigning thickness point values directly to corner point arrays), I have revised my PR to avoid the complexity of retaining the previous answers, but this could still be revisited if there is a strong desire to keep the older (buggy) algorithms.

iangrooms commented 10 months ago

Thanks for your work and input on this! I'm hoping to get time to look at the PR in the next week. I agree that there is no point in setting up code to retain previous answers.

I am not convinced that assigning thickness point values directly to corner point arrays is an unambiguous bug (bearing in mind that I haven't looked at your code yet). If I recall correctly, I drew my inspiration from the original Leith code, which uses bilinear interpolation to map viscosity values from h points to q points. I could not use the exact same idea because it would have required a halo update, so instead I used 'nearest-neighbor' interpolation to map from h points to q points. Rather than mapping northeast (which would have had indices Ah(i,j) = Ah_h(i,j)), I think I mapped southwest, which produces indices Ah(i,j) = Ah_h(i+1,j+1). Although this does assign h point values at q points, it was done deliberately to avoid communication and is simply a low-order form of interpolation.

Hallberg-NOAA commented 10 months ago

In the revised code that I added for your PR, I was able to use the 4-point average (bilinear interpolation) to map the h-point viscosities onto the q-points without adding any more halo updates, but it required that we do the h-point viscosity calculation over a larger range of indices when we have USE_LEITHY = True. As you will see when you get a chance to look through the code changes, this led to a fairly extensive list of repetitious changes to index loops ranges, but in the end it should have no impact on performance when Leith+E is not used, and minimal performance impact when it is used.

I agree that a direct translation of the viscosity from one of the neighboring thickness points onto a vorticity point is a consistent 1st order accurate interpolation, but it does have the down-side in that by selecting a preferred direction (southwest) that would break the underlying rotational symmetry of the solution. Apart from the reduced accuracy, we are particularly vigilant about avoiding this because we can detect and correct all such symmetry breaking, and because it might run the risk of projecting land values out into the ocean on some coastlines but not others.

iangrooms commented 10 months ago

That makes sense; thanks for clarifying. And thanks again for your time on this.

iangrooms commented 5 months ago

I think these problems were addressed in #268. This issue could be closed now.

Hallberg-NOAA commented 5 months ago

Yes, I agree that this issue should now be closed.