Bug in LaplacePCR - Githubissues

ZedThree commented 3 years ago

In #2336 I've added an alternative to #2333: https://github.com/boutproject/BOUT-dev/blob/93454ecaded94667725d48c3cb9b991c23f239bc/src/invert/laplace/invert_laplace.cxx#L106-L126

Basically check that we can use PCR as the default Laplacian, otherwise fall back to cyclic. This essentially makes PCR the new default in most cases.

Unfortunately (or possibly fortunately?) this throws up a few issues with PCR in the following tests:

test-gyro, which is a regression test, comparing the inversion to a benchmark result:

Running Gyro-average inversion test
1 processors (nxpe = 1)....
  Checking variable pade1 ... Fail, maximum difference = 0.021458830684423447
  Checking variable pade2 ... Fail, maximum difference = 0.0043205489637330174
2 processors (nxpe = 1)....
  Checking variable pade1 ... Fail, maximum difference = 0.021458830684423447
  Checking variable pade2 ... Fail, maximum difference = 0.0043205489637330174
4 processors (nxpe = 2)....
  Checking variable pade1 ... Fail, maximum difference = 0.021458830684423447
  Checking variable pade2 ... Fail, maximum difference = 0.0043205489637330174

test-petsc-laplace-MAST-grid, which applies the forward operator and inverts the result, checking round-trip accuracy, tests 4 and 8 use the default solver (i.e. PCR), all others use PETSc (errors are the same across processor count and grid):

  Checking max_error1 ... Pass
  Checking max_error2 ... Pass
  Checking max_error3 ... Pass
  Checking max_error4 ... Fail, maximum error is = 0.9939465685503767
  Checking max_error5 ... Pass
  Checking max_error6 ... Pass
  Checking max_error7 ... Pass
  Checking max_error8 ... Fail, maximum error is = 1.0057517623023355

test-petsc-laplace, which is similar to the previous test, but the PETSc solvers use the default Laplacian as a preconditioner:
```
KSPConvergedReason is -5
BoutException occured in invert->solve(b1): petsc_laplace: inversion failed to converge.
```

Test 1: PETSc 2nd order Magnitude of maximum absolute error is -1 KSPConvergedReason is -5 BoutException occured in invert->solve(b1): petsc_laplace: inversion failed to converge.

Test 2: PETSc 4th order Magnitude of maximum absolute error is -1 KSPConvergedReason is -5 BoutException occured in invert->solve(b3): petsc_laplace: inversion failed to converge.

Test 3: with coefficients constant in z, PETSc 2nd order Magnitude of maximum absolute error is -1

Test 4: with coefficients constant in z, default solver Magnitude of maximum absolute error is 0.999817 KSPConvergedReason is -5 BoutException occured in invert->solve(b5): petsc_laplace: inversion failed to converge.

Test 5: different profiles, PETSc 2nd order Magnitude of maximum absolute error is -1 KSPConvergedReason is -5 BoutException occured in invert->solve(b6): Laplacian inversion failed to converge (probably)

Test 6: different profiles, PETSc 4th order Magnitude of maximum absolute error is -1 KSPConvergedReason is -5 BoutException occured in invert->solve(b7): petsc_laplace: inversion failed to converge.

Test 7: different profiles, with coefficients constant in z, PETSc 2nd order Magnitude of maximum absolute error is -1

Test 8: different profiles, with coefficients constant in z, default solver Magnitude of maximum absolute error is 1.00577



Given that `test-laplace` is checked with both `cyclic` and `pcr` there must be something else going on. The common factor seems to be the outer boundary flags:

- test-laplace checks only `INVERT_SET` in the outer boundary
- test-gyro uses `INVERT_BNDRY_ONE + INVERT_RHS`
- test-petsc_laplace uses `INVERT_AC_GRAD` and `INVERT_AC_GRAD | INVERT_DC_GRAD`
- test-petsc_laplace_MAST-grid uses `INVERT_AC_GRAD`

My suspicions are that `LaplacePCR ::eliminate_boundary_rows` only accounts for one boundary, when there could be up to `MXG` boundary rows that need eliminating, depending on the inversion flags.

JosephThomasParker commented 3 years ago

Thanks Peter! The purpose of eliminate boundary rows is just to break the linear dependence between the boundary rows and the interior rows, so it should only require the one boundary row. Though it does assume that the structure is

                            boundary. | interior
boundary ->               0  x  x  0  |. 0. 0. 0  
boundary ->               0. 0. x. x. |  0  0. 0
boundary ->.              0. 0. 0  x. |  x. 0. 0
interior ->               0. 0. 0. x. |. x. x. 0
interior ->.              0. 0. 0. 0. |  x. x. x

becomes

                            boundary. | interior
boundary ->               0  x  x  0  |. 0. 0. 0  
boundary ->               0. 0. x. x. |  0  0. 0
boundary ->.              0. 0. 0  x. |  x. 0. 0
interior ->               0. 0. 0. 0. |. Y. x. 0
interior ->.              0. 0. 0. 0. |  x. x. x

Are there cases where this won't be the structure?

There might also be places where this assumes there are two guard cells, so I'll check that too.

ZedThree commented 3 years ago

Ah I see! I was thinking that might be suspicious just because I was tracing through where the a/b/c from Laplacian::tridagMatrix are used. That deals with INVERT_AC_GRAD for both PCR and cyclic, so presumably the issue is how they are used.

So if it's not eliminate_boundary_rows, it must be somewhere else in cr_pcr_solver?

Is the issue that aa, bb, cc only contain one boundary/guard cell? For cyclic, a, b, c don't use any guard cells, but do include all the boundaries.

dschwoerer commented 3 years ago

Might it be related to the failure in #2207 ?

That is also failing if rebased to next ...

JosephThomasParker commented 3 years ago

Just to quickly note a few things: There is a problem in eliminate_boundary_rows: in the test-gyro case the wrong rows are eliminated (the second and penultimate internal rows, instead of the first and final internal rows). This is confusing as the indexing is determined by localmesh->xstart. Does the meaning of xstart change depending on the boundary conditions?

But in any case, fixing the indexing still gets the wrong results, with the error in \|Ax - b\| only really being significant in the few rows near the boundaries.

Still working on this...

JosephThomasParker commented 3 years ago

Ok, found the problem: the code requires the number of internal rows to be a power of 2, and checks this by doing:

  // Number of x points must be a power of 2
  if (!is_pow2(localmesh->GlobalNxNoBoundaries)) {
    throw BoutException("LaplacePCR error: GlobalNx must be a power of 2");
  }

where GlobalNxNoBoundaries = nx - 2*MXG; is a power of two in both test-laplace and test-gyro. However, in test-gyro, the flag INVERT_BNDRY_ONE is set, which means only one boundary cell is used, even though MXG=2. So in test-gyro we're actually doing the inversion on 68 - 2*1 = 66 points, not 68 - 2*2 = 64 points, so that's why the solver fails.

I'll think about whether it's possible to tweak the row elimination to get this to work. The "real" problem though is the inconsistency over how many guard cells we are using.

ZedThree commented 3 years ago

Is this a "fundamental" problem that PCF is incompatible with INVERT_BNDRY_ONE? Or it would work, if the number of interior points were different such that we'd still use a power of two?

If the latter, then we could adjust the precondition to:

const int global_nx = localmesh->GlobalNxNoBoundaries;
const int global_nx_maybe_with_boundary = using_invert_bndry_one ? global_nx + 1 : global_nx;

if (not is_pow2(global_nx_maybe_with_boundary)) {
  ...

(terrible names not withstanding). Then in #2336 we'd neatly fall back to cyclic if that weren't met.

If we can handle this in the row elimination and don't need to adjust the precondition, that's probably the better solution though?

johnomotani commented 3 years ago

I'd be tempted to suggest just deprecating/removing INVERT_BNDRY_ONE. I can't see when it would ever be useful, and I'm not sure why it's there at all - maybe @bendudson knows?

bendudson commented 3 years ago

Don't know why INVERT_BNDRY_ONE exists really. I suspect it was an attempt to deal with cases where the boundary to be applied isn't clear, such as with gyro-averaging: What boundary condition should be put on gyro-averaged density, to be consistent with density? By moving the boundary away from the domain, it may give that bit of additional freedom at the actual boundary. I'm not sure if it actually makes any difference though.

ZedThree commented 3 years ago

The plan is now to make solve throw if INVERT_AC_GRAD is set. This means we won't use PCR as the default solver.

boutproject / BOUT-dev

Bug in LaplacePCR #2350