Closed kostrzewa closed 6 years ago
The same setup works like a charm with SOALEN=4
A cross-check of this would be highly appreciated...
@martin-ueding you were wondering if there were something that you could do right now. A cross-check of this issue would be excellent! I would really prefer to run with SOALEN=8 on Marconi, but right now, I can't do that ...
Hi All, I have not had time to look at this. I tried running on NERSC, where I ran into some solver nonconvergence, however I think I got that also with regular BiCGStab so I expect it is actually physics in my case. I will investigate this in more depth later when I have the chance.
Best, B
On Aug 7, 2017, at 9:20 AM, Bartosz Kostrzewa notifications@github.com wrote:
@martin-ueding you were wondering if there were something that you could do right now. A cross-check of this issue would be excellent! I would really prefer to run with SOALEN=8 on Marconi, but right now, I can't do that ...
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
@kostrzewa: What is the problem? SOALEN = 8
does not compile? Marconi A2 is down? I guess you are running with tmLQCD, where can I find the input file and the configuration that you are testing?
@martin-ueding
@kostrzewa: What is the problem? SOALEN = 8 does not compile?
See the description, the mixed solver diverges for soalen = 8 (see r_norm_sq after the first outer iteration), even though the individual solvers converge just fine and give correct results. I suspect there is an issue with the iterative refinement (passing low precision update of 'x' and 'r' to high precision).
It would be important to test this outside of tmLQCD.
So you want a unit test for the Richardson solver in the CG mode on the QPhiX side that just does an inversion and checks it by multiplying back (or comparing to QDP or a different QPhiX solver)?
You wrote that it only fails in the strong scaling limit with that 16×8³ lattice size (bounded by DDalphaAMG I presume). In the output, there is
# QphiX: Local Lattice Size (xyzt) = 16 8 8 8
After checkerboarding, this will be 8⁴. And then VECLEN_inner=16, SOALEN_inner=8
, so this is indeed the smallest X-extent that should be possible, right? At first I thought that this might cause problems, that that is exactly what we have the vector folding for.
I'll look into writing a unit test for your branch then.
So you want a unit test for the Richardson solver in the CG mode on the QPhiX side that just does an inversion and checks it by multiplying back (or comparing to QDP or a different QPhiX solver)?
The unit test is probably a good idea, but it would be sufficient to run an inversion using Chroma with the Richardson solver (the BiCGstab version) on Marconi A2.
After checkerboarding, this will be 8⁴. And then VECLEN_inner=16, SOALEN_inner=8, so this is indeed the smallest X-extent that should be possible, right? At first I thought that this might cause problems, that that is exactly what we have the vector folding for.
Yes, it should work. I will also try again now that I've pushed so many commits. Perhaps there was some inconsistency somewhere when I tested this...
To summarize: I have tried Bartek's branch with some fixes for compilation on Marconi A2 (KNL). With Chroma set up for soalen=8
I ran a simple inversion (Chroma propagator) using the ITER_REFINE
solver with BICGSTAB
on a thermalized Wilson clover configuration with sea and valence quark masses equal. This did not converge:
RICHARDSON: Initial || r ||^2 = 2.67214178e-01 Target || r ||^2 = 2.67214178e-21
RICHARDSON: Solver NOT converged at iter 10000 || r || = 1.33743272e+00
RICHARDSON: Inner MV Apps=409933 Outer MV Apps=10001 Inner Site Flops=192688248 Outer Site Flops=960093
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 10000 iters, rsd_sq_final=1.78872626910592
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 1.33743271573037
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=569.290246963501 (sec) Performance=200.965667717499 GFLOPS
However, with soalen=4
, it works:
RICHARDSON: Initial || r ||^2 = 2.67214178e-01 Target || r ||^2 = 2.67214178e-21
RICHARDSON: Solver converged at iter 5 || r || = 9.09133473e-11
RICHARDSON: Inner MV Apps=377 Outer MV Apps=7 Inner Site Flops=178920 Outer Site Flops=669
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 5 iters, rsd_sq_final=8.26523672478744e-21
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 9.09133449338274e-11
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=7.01472187042236 (sec) Performance=14.9375031745472 GFLOPS
So the problem depends on the soalen
which means that it is some implementation detail.
Thanks for the cross-check! Since it is apparently independent of the inner solver (I was using CG with my MdagM addition to the Richardson solver), it seems likely that there's a bug somewhere in the up-conversion. As a final test, just to cross-check what I did, you could also try simple BiCGstab in single and double precision with soalen=8, just to confirm that the inner solver indeed works correctly.
The following BiCGStab configurations work:
Precision | SoA Length |
---|---|
double | 4 |
double | 8 |
single | 8 |
In the Richardson case, my results are the following on Marconi A2 (KNL) are:
Outer veclen | Outer soalen | Inner veclen | Inner soalen | Result |
---|---|---|---|---|
8 | 8 | 16 | 4 | Diverges |
8 | 4 | 16 | 4 | 9.09133473e-11 |
8 | 8 | 16 | 8 | Breakdown in iteration 1. rho = 0 |
I have looked into the source code of the Richardson solver and the conversion code. The index manipulations looks legit, the types seem to be passed correctly.
From the number it seems that the issue could be a mismatch between the SoA length in the inner and outer solver.
Hi All, It is great that you are doing these tests. Can you remind me of the local lattice volume?
(Assuming you are using outer=double, inner = single)
Outer Outer Inner Inner Veclen Soalen Veclen Soalen 8 4 16 8
?
Best, B
On Aug 10, 2017, at 7:34 AM, Martin Ueding notifications@github.com wrote:
The following BiCGStab configurations work:
Precision SoA Length double 4 double 8 single 8 In the Richardson case, my results are the following on Marconi A2 (KNL) are:
Outer veclen Outer soalen Inner veclen Inner soalen Result 8 8 16 4 Diverges 8 4 16 4 9.09133473e-11 8 8 16 8 In Investigation I have looked into the source code of the Richardson solver and the conversion code. The index manipulations looks legit, the types seem to be passed correctly.
From the number it seems that the issue could be a mismatch between the SoA length in the inner and outer solver.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
The volume here is a L=16 and T=32 lattice on a single MPI rank. So even after checkerboarding, a SoA length of 8 should be just fine.
I have updated the table above, with SoA length 8 for both, it does not work. So it seems that the SoA length has to be the same and it has to be 4. We could run more tests, like SoA length 8/16 for outer/inner solvers. Would that make sense?
What tests could be done within QPhiX to sort this out?
Thanks for the check, Martin. It seems pretty clear that we observe the same issue. (what do you mean, rho=0? The preconditioning mass?)
I think now one just needs to confirm if the up-/down-conversion in the Richardson iteration works correctly for the problematic case(s). This should be relatively easy to test, since one simply tests for equality up to the low precision (times some fudge factor).
If that's not the culprit, then there's perhaps something more nefarious going on...
The rho = 0
is some output from the solver. It has to be some variable in the multi-prec solver. I just copied that for reference.
I can try to write such a unit test in the next days.
Okay, I checked and rho
comes from BiCGstab. One of the two residuals calculated in the BiCGstab iteration is zero, which would suggest that the there are zero-norm inputs passed to the solver. Does this happen in the first outer iteration or the second?
This is what I get:
Starting solve
RICHARDSON: Initial || r ||^2 = 2.67214178e-01 Target || r ||^2 = 2.67214178e-21
BICGSTAB: Breakdown in iteration 1. rho = 0
BICGSTAB: Breakdown in iteration 1. rho = 0
The last line is repeated until it runs into the time limit.
The source of this apparently is the following line:
include/qphix/invbicgstab.h:142: masterPrintf("BICGSTAB: Breakdown in iteration %d. rho = 0\n", k);
The rho_c
is the inner product of r_0
and r
computed with the innerProduct
function. So one of the vectors is zero.
There is a verbosity flag that one can turn on, I have just done this and restarted the job that produced this line. The additional output should be helpful:
RICHARDSON: iter=120 Delta = 1.00000000e-02 Delta Finish= 1.48489145e-48 Delta_use= 1.00000000e-02
BICGSTAB: ||b||^2 = 1.812891e-01 Target Rsd= 1.812891e-05
BICGSTAB: iter 1 r_norm = 1.034537e-02 target = 1.812891e-05
BICGSTAB: iter 2 r_norm = 8.359862e-02 target = 1.812891e-05
BICGSTAB: iter 3 r_norm = 3.579327e-03 target = 1.812891e-05
BICGSTAB: iter 4 r_norm = 3.292702e-02 target = 1.812891e-05
BICGSTAB: iter 5 r_norm = 1.155886e-01 target = 1.812891e-05
BICGSTAB: iter 6 r_norm = 1.879570e-01 target = 1.812891e-05
BICGSTAB: iter 7 r_norm = 3.998165e-02 target = 1.812891e-05
BICGSTAB: iter 8 r_norm = 3.424387e-02 target = 1.812891e-05
BICGSTAB: iter 9 r_norm = 1.201591e-02 target = 1.812891e-05
BICGSTAB: iter 10 r_norm = 6.793151e-03 target = 1.812891e-05
BICGSTAB: iter 11 r_norm = 7.557945e-03 target = 1.812891e-05
BICGSTAB: iter 12 r_norm = 6.183899e-03 target = 1.812891e-05
BICGSTAB: iter 13 r_norm = 5.557747e-03 target = 1.812891e-05
BICGSTAB: iter 14 r_norm = 4.977096e-02 target = 1.812891e-05
BICGSTAB: iter 15 r_norm = 6.504697e-02 target = 1.812891e-05
BICGSTAB: iter 16 r_norm = 1.235174e-02 target = 1.812891e-05
BICGSTAB: iter 17 r_norm = 3.707913e-03 target = 1.812891e-05
BICGSTAB: iter 18 r_norm = 2.026293e-03 target = 1.812891e-05
BICGSTAB: iter 19 r_norm = 1.193662e-03 target = 1.812891e-05
BICGSTAB: iter 20 r_norm = 1.218229e-03 target = 1.812891e-05
BICGSTAB: iter 21 r_norm = 3.733477e-04 target = 1.812891e-05
BICGSTAB: iter 22 r_norm = 2.647406e-04 target = 1.812891e-05
BICGSTAB: iter 23 r_norm = 2.384216e-04 target = 1.812891e-05
BICGSTAB: iter 24 r_norm = 3.210701e-04 target = 1.812891e-05
BICGSTAB: iter 25 r_norm = 1.881972e-04 target = 1.812891e-05
BICGSTAB: iter 26 r_norm = 1.179646e-04 target = 1.812891e-05
BICGSTAB: iter 27 r_norm = 7.464495e-05 target = 1.812891e-05
BICGSTAB: iter 28 r_norm = 6.361845e-05 target = 1.812891e-05
BICGSTAB: iter 29 r_norm = 5.130439e-05 target = 1.812891e-05
BICGSTAB: iter 30 r_norm = 4.679175e-05 target = 1.812891e-05
BICGSTAB: iter 31 r_norm = 3.777697e-05 target = 1.812891e-05
BICGSTAB: iter 32 r_norm = 3.102388e-05 target = 1.812891e-05
BICGSTAB: iter 33 r_norm = 2.507007e-05 target = 1.812891e-05
BICGSTAB: iter 34 r_norm = 2.522580e-05 target = 1.812891e-05
BICGSTAB: iter 35 r_norm = 2.673134e-05 target = 1.812891e-05
BICGSTAB: iter 36 r_norm = 2.607818e-05 target = 1.812891e-05
BICGSTAB: iter 37 r_norm = 1.293327e-04 target = 1.812891e-05
BICGSTAB: iter 38 r_norm = 2.408174e-04 target = 1.812891e-05
BICGSTAB: iter 39 r_norm = 1.883567e-04 target = 1.812891e-05
BICGSTAB: iter 40 r_norm = 3.284182e-05 target = 1.812891e-05
BICGSTAB: iter 41 r_norm = 6.708810e-06 target = 1.812891e-05
RICHARDSON: Iter 120 r_norm_sq=5.724560e+75 Target=2.672142e-21
RICHARDSON: iter=121 Delta = 1.00000000e-02 Delta Finish= 6.83217046e-49 Delta_use= 1.00000000e-02
BICGSTAB: ||b||^2 = 1.814572e-01 Target Rsd= 1.814572e-05
BICGSTAB: iter 1 r_norm = 1.035304e-02 target = 1.814572e-05
BICGSTAB: iter 2 r_norm = 8.370212e-02 target = 1.814572e-05
BICGSTAB: iter 3 r_norm = 3.580684e-03 target = 1.814572e-05
BICGSTAB: iter 4 r_norm = 3.290278e-02 target = 1.814572e-05
BICGSTAB: iter 5 r_norm = 1.154639e-01 target = 1.814572e-05
BICGSTAB: iter 6 r_norm = 1.865597e-01 target = 1.814572e-05
BICGSTAB: iter 7 r_norm = 3.978468e-02 target = 1.814572e-05
BICGSTAB: iter 8 r_norm = 3.364065e-02 target = 1.814572e-05
BICGSTAB: iter 9 r_norm = 1.200913e-02 target = 1.814572e-05
BICGSTAB: iter 10 r_norm = 6.669762e-03 target = 1.814572e-05
BICGSTAB: iter 11 r_norm = 7.425735e-03 target = 1.814572e-05
BICGSTAB: iter 12 r_norm = 6.067449e-03 target = 1.814572e-05
BICGSTAB: iter 13 r_norm = 5.632251e-03 target = 1.814572e-05
BICGSTAB: iter 14 r_norm = 5.506331e-02 target = 1.814572e-05
BICGSTAB: iter 15 r_norm = 8.002124e-02 target = 1.814572e-05
BICGSTAB: iter 16 r_norm = 1.425816e-02 target = 1.814572e-05
BICGSTAB: iter 17 r_norm = 4.207211e-03 target = 1.814572e-05
BICGSTAB: iter 18 r_norm = 2.547295e-03 target = 1.814572e-05
BICGSTAB: iter 19 r_norm = 1.525623e-03 target = 1.814572e-05
BICGSTAB: iter 20 r_norm = 1.640748e-03 target = 1.814572e-05
BICGSTAB: iter 21 r_norm = 3.953916e-04 target = 1.814572e-05
BICGSTAB: iter 22 r_norm = 2.853951e-04 target = 1.814572e-05
BICGSTAB: iter 23 r_norm = 2.437643e-04 target = 1.814572e-05
BICGSTAB: iter 24 r_norm = 3.388092e-04 target = 1.814572e-05
BICGSTAB: iter 25 r_norm = 1.831258e-04 target = 1.814572e-05
BICGSTAB: iter 26 r_norm = 1.150974e-04 target = 1.814572e-05
BICGSTAB: iter 27 r_norm = 8.481940e-05 target = 1.814572e-05
BICGSTAB: iter 28 r_norm = 6.993216e-05 target = 1.814572e-05
BICGSTAB: iter 29 r_norm = 5.573235e-05 target = 1.814572e-05
BICGSTAB: iter 30 r_norm = 4.996171e-05 target = 1.814572e-05
BICGSTAB: iter 31 r_norm = 3.904734e-05 target = 1.814572e-05
BICGSTAB: iter 32 r_norm = 3.209256e-05 target = 1.814572e-05
BICGSTAB: iter 33 r_norm = 2.457926e-05 target = 1.814572e-05
BICGSTAB: iter 34 r_norm = 2.454816e-05 target = 1.814572e-05
BICGSTAB: iter 35 r_norm = 2.027693e-05 target = 1.814572e-05
BICGSTAB: iter 36 r_norm = 3.923647e-04 target = 1.814572e-05
BICGSTAB: iter 37 r_norm = 4.092540e-05 target = 1.814572e-05
BICGSTAB: iter 38 r_norm = 6.020191e-05 target = 1.814572e-05
BICGSTAB: iter 39 r_norm = 2.355953e-04 target = 1.814572e-05
BICGSTAB: iter 40 r_norm = 1.019154e-05 target = 1.814572e-05
RICHARDSON: Iter 121 r_norm_sq=2.707322e+76 Target=2.672142e-21
RICHARDSON: iter=122 Delta = 1.00000000e-02 Delta Finish= 3.14166461e-49 Delta_use= 1.00000000e-02
BICGSTAB: ||b||^2 = 0.000000e+00 Target Rsd= 0.000000e+00
BICGSTAB: Breakdown in iteration 1. rho = 0
RICHARDSON: Iter 122 r_norm_sq=2.707322e+76 Target=2.672142e-21
RICHARDSON: iter=123 Delta = 1.00000000e-02 Delta Finish= 3.14166461e-49 Delta_use= 1.00000000e-02
BICGSTAB: ||b||^2 = 0.000000e+00 Target Rsd= 0.000000e+00
BICGSTAB: Breakdown in iteration 1. rho = 0
RICHARDSON: Iter 123 r_norm_sq=2.707322e+76 Target=2.672142e-21
RICHARDSON: iter=124 Delta = 1.00000000e-02 Delta Finish= 3.14166461e-49 Delta_use= 1.00000000e-02
BICGSTAB: ||b||^2 = 0.000000e+00 Target Rsd= 0.000000e+00
BICGSTAB: Breakdown in iteration 1. rho = 0
The complete output is large, here is the full thing: richardson.o331444.tar.gz
It seems to me that the inner solver needs 40-ish iterations and that the outer solver does 121 iterations without a problem. Then on the 122nd outer iteration the inner solver directly has a problem. So it does not happen right away but it seems that the residual is diverging.
If you look at the residual squared in terms of the outer iteration number, one sees a linear trend in a half-log plot, meaning an exponential growth:
So this looks like a numerical instability where errors accumulate in an exponential fashion. But only for this particular soalen configuration.
I interpret this as something being just slightly wrong, and I lack the intuition to know the most promising starting point would be.
Okay, so it diverges, just like in my case. Good to run with maximum verbosity when testing things ;D
So this looks like a numerical instability where errors accumulate in an exponential fashion. But only for this particular soalen configuration.
No, it's not a numerical instability, otherwise the solver would always diverge. It is, with a high degree of probability, a technical problem related to the up- and down-conversion between inner and outer precisions. We can guess this because BiCGstab (and CG, for that matter) itself converge fine in either precision, as we've both tested, whatever SOALEN is used.
You beat me to it, so this is somewhat redundant.
Just to summarize: We have established that the individual CG and BiCGStab solvers work with all soalen settings. We know that the Richardson solvers works with both CG and BiCGStab with soalen=4. So the problem is something inherent to the Richardson solver that depends on the soalen.
I have rewritten the convert
function to make it more readable:
template <typename FTOut,
int VOut,
int SOut,
bool CompressOut,
typename FTIn,
int VIn,
int SIn,
bool CompressIn>
void convert(
typename Geometry<FTOut, VOut, SOut, CompressOut>::FourSpinorBlock *spinor_out,
double scale_factor,
const typename Geometry<FTIn, VIn, SIn, CompressIn>::FourSpinorBlock *spinor_in,
const Geometry<FTOut, VOut, SOut, CompressOut> &geom_out,
const Geometry<FTIn, VIn, SIn, CompressIn> &geom_in,
int n_blas_threads)
{
typedef typename ArithType<FTOut>::Type AT_out;
// Get the subgrid latt size.
int Nt = geom_out.Nt();
int Nz = geom_out.Nz();
int Ny = geom_out.Ny();
int Nxh = geom_out.Nxh();
int nvecs_out = geom_out.nVecs();
int Pxy_out = geom_out.getPxy();
int Pxyz_out = geom_out.getPxyz();
int nvecs_in = geom_in.nVecs();
int Pxy_in = geom_in.getPxy();
int Pxyz_in = geom_in.getPxyz();
#pragma omp parallel for collapse(4)
for (int t = 0; t < Nt; t++) {
for (int z = 0; z < Nz; z++) {
for (int y = 0; y < Ny; y++) {
for (int s = 0; s < nvecs_out; s++) {
for (int col = 0; col < 3; col++) {
for (int spin = 0; spin < 4; spin++) {
for (int x = 0; x < SOut; x++) {
// Index of the soavector from the output.
int const ind_out = t * Pxyz_out + z * Pxy_out + y * nvecs_out + s;
// Actual X coordinate is given by the number of elements in each
// soavector `SOut` and the position within the soavector `x`.
int const x_coord = s * SOut + x;
// The soavector on the current line of X elements is computed from the
// actual X coordinate `x_coord` and the input soalen `SIn`.
int const s_in = x_coord / SIn;
// The position within the input soavector is given by subtracting the X
// coordinate of the head of the input soavector from the actual X
// coordinate.
int const x_in = x_coord - SIn * s_in;
// The index of the input soavector is computed analogous to the output
// soavector.
int const ind_in = t * Pxyz_in + z * Pxy_in + y * nvecs_in + s_in;
// Iterate over real and imaginary part.
for (int const reim : {0, 1}) {
// Extract source and target from the data structures.
auto const source = spinor_in[ind_in][col][spin][reim][x_in];
auto &target = spinor_out[ind_out][col][spin][reim][x];
// Convert the input numbers into the arithmetic type.
auto const scale_factor_rep = rep<AT_out, double>(scale_factor);
auto const source_rep = rep<AT_out, FTIn>(source);
// Perform the multiplication within the arithmetic types.
auto const result_rep = scale_factor_rep * source_rep;
// Convert the result back into the storage type.
auto const result = rep<FTOut, AT_out>(result_rep);
// Assign the result into the target array element.
target = result;
}
}
}
}
}
}
}
}
}
I do not think that there is some error in the index manipulations, it seems reasonable. The next step probably is a simple unit test of the convert
function such that we can really exclude this as the source of errors?
The other possible issue could be that the scale factor is computed and/or passed incorrectly for some reason. (this is a re-scaling by the current residual which is meant to prevent low-precision underruns)
@bjoo Did you have a chance to also look into this?
The error message with rho = 0
comes from the BiCGStab itself, not from the Richardson solver. This is where it comes from:
// Check norm of r0
norm2Spinor<FT, V, S, compress12, num_flav>(r_norm, r0, geom, norm2Threads);
if (r_norm < rsd_sq) {
masterPrintf("BICGSTAB converged at iter 0: ||r||=%16.8e target=%16.8e\n",
sqrt(r_norm),
sqrt(rsd_sq));
rsd_sq_final = r_norm;
n_iters = 0;
return;
}
// r = r0
copySpinor<FT, V, S, compress12, num_flav>(r, r0, geom, copyThreads);
// Now intitialize p and v to zero.
zeroSpinor<FT, V, S, compress12, num_flav>(p, geom, zeroThreads);
zeroSpinor<FT, V, S, compress12, num_flav>(v, geom, zeroThreads);
// rho_prev = 1
double rho_prev_c[2] = {1.0, 0.0};
// alpha = 1
double alpha_c[2] = {1.0, 0.0};
// omega = 1
double omega_c[2] = {1.0, 0.0};
double rho_c[2] = {0.0, 0.0};
bool notConvP = true;
int k = 1;
// Now iterate
for (k = 1; (k <= MaxIters) && notConvP; k++) {
// rho_{k+1} = < r_0 | r >
innerProduct<FT, V, S, compress12, num_flav>(
rho_c, r0, r, geom, innerProductThreads);
site_flops += 8 * 12 * num_flav;
if ((rho_c[0] == 0.0) && (rho_c[1] == 0.0)) {
masterPrintf("BICGSTAB: Breakdown in iteration %d. rho = 0\n", k);
rsd_sq_final = r_norm;
n_iters = k;
return;
}
We have breakdown in iteration 1
, so that is the very first iteration of that for
loop. This means that the setup code above has just run. And there is a copy from r0
to r
. When we take the scalar product within the for
loop, we get the norm squared of that vector. This can only be exactly zero if those spinors were exactly zero. A == 0.0
comparison is very brittle, so it usually only works when we have exact 0.0
in all the elements.
But the norm of r0
is computed right before it is copied into r
. And if r0
indeed had a norm squared of exactly 0.0
, the BiCGStab would have stopped right there, because a vanishing residual indicates convergence.
This does not make any sense, right? We should never see the breakdown in iteration 1, rho = 0
error message, we should either see convergence a few lines of code above or it should do a proper iteration. It should be impossible to have a breakdown in iteration 1.
So this means that perhaps the innerProduct
or the norm2Spinor
function does something wrong? But I did check that, so it cannot be just that.
We have a problem that manifests itself in the BiCGStab, but only when used in the Richardson solver. But also when one uses the CG, similar things happen. So it must be inherent to the Richardson solver. But there also is seemingly impossible behavior in the BiCGStab.
What are we missing here?
In principle this can arise if both the RHS and the initial Guess are zero? Does the breakdown occur with random initial guess?
Best, B
On Nov 6, 2017, at 2:08 PM, Martin Ueding notifications@github.com wrote:
The error message with rho = 0 comes from the BiCGStab itself, not from the Richardson solver. This is where it comes from:
// Check norm of r0 norm2Spinor<FT, V, S, compress12, num_flav>(r_norm, r0, geom, norm2Threads); if (r_norm < rsd_sq) { masterPrintf("BICGSTAB converged at iter 0: ||r||=%16.8e target=%16.8e\n", sqrt(r_norm), sqrt(rsd_sq)); rsd_sq_final = r_norm; n_iters = 0; return; } // r = r0 copySpinor<FT, V, S, compress12, num_flav>(r, r0, geom, copyThreads); // Now intitialize p and v to zero. zeroSpinor<FT, V, S, compress12, num_flav>(p, geom, zeroThreads); zeroSpinor<FT, V, S, compress12, num_flav>(v, geom, zeroThreads); // rho_prev = 1 double rho_prev_c[2] = {1.0, 0.0}; // alpha = 1 double alpha_c[2] = {1.0, 0.0}; // omega = 1 double omega_c[2] = {1.0, 0.0}; double rho_c[2] = {0.0, 0.0}; bool notConvP = true; int k = 1; // Now iterate for (k = 1; (k <= MaxIters) && notConvP; k++) { // rho_{k+1} = < r_0 | r > innerProduct<FT, V, S, compress12, num_flav>( rho_c, r0, r, geom, innerProductThreads); site_flops += 8 * 12 * num_flav; if ((rho_c[0] == 0.0) && (rho_c[1] == 0.0)) { masterPrintf("BICGSTAB: Breakdown in iteration %d. rho = 0\n", k); rsd_sq_final = r_norm; n_iters = k; return; }
We have breakdown in iteration 1, so that is the very first iteration of that for loop. This means that the setup code above has just run. And there is a copy from r0 to r. When we take the scalar product within the for loop, we get the norm squared of that vector. This can only be exactly zero if those spinors were exactly zero. A == 0.0 comparison is very brittle, so it usually only works when we have exact 0.0 in all the elements.
But the norm of r0 is computed right before it is copied into r. And if r0 indeed had a norm squared of exactly 0.0, the BiCGStab would have stopped right there, because a vanishing residual indicates convergence.
This does not make any sense, right? We should never see the breakdown in iteration 1, rho = 0 error message, we should either see convergence a few lines of code above or it should do a proper iteration. It should be impossible to have a breakdown in iteration 1.
So this means that perhaps the innerProduct or the norm2Spinor function does something wrong? But I did check that, so it cannot be just that.
We have a problem that manifests itself in the BiCGStab, but only when used in the Richardson solver. But also when one uses the CG, similar things happen. So it must be inherent to the Richardson solver. But there also is seemingly impossible behavior in the BiCGStab.
What are we missing here?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
If the right-hand-side and the initial guess are zero, then r0
will hold a zero value. But then in the next step, the square norm of r0
is computed, and if that is below the threshold, the solver has converged. This is not happening, therefore r0
does not have zero norm, and either initial guess or right-hand-side are non-zero.
HI Martin, I’ve just noticed you say that this is in the ‘strong scaling limit’ What are the lattice sizes (local)? Whatis the SOALEN? Best, B
On Nov 6, 2017, at 2:14 PM, Martin Ueding notifications@github.com wrote:
If the right-hand-side and the initial guess are zero, then r0 will hold a zero value. But then in the next step, the square norm of r0 is computed, and if that is below the threshold, the solver has converged. This is not happening, therefore r0 does not have zero norm, and either initial guess or right-hand-side are non-zero.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Our strong scaling limit is imposed by DDalphaAMG and that limits us to 16×8³ of a local lattice.
And see this older comment about the SOALEN that we are using. This problem seem to occur in various manifestations. I am currently looking at the soalen = 8
case with double/single because there the error message is somewhat clear to locate.
From the older comment, it seems that the following lead to a diverging problem:
I have just run this again with the brand new version of QPhiX, and it did converge! Now I will compile again with soalen-8-8 and see whether the problem has vanished there. If so, that would be marvelous. Bartek would have to cross-check, but it might be that this bug has vanished by itself for some reason.
This is the output of Chroma, running with one MPI process and 256 threads on a single KNL:
Veclen is 8
Soalen is 8
Inner Veclen is 16
Inner Soalen is 8
...
Starting solve
RICHARDSON: RelativeResiduum requested
RICHARDSON: Initial || r ||^2 = 2.67214178e-01 Target || r ||^2 = 2.67214178e-21
RICHARDSON: iter=1 Delta = 1.00000000e-02 Delta Finish= 1.00000000e-10 Delta_use= 1.00000000e-02
BICGSTAB: Relative Residuum requested
BICGSTAB: Target Rsd= 1.000000e-04
BICGSTAB: iter 1 r_norm = 6.512983e-02 target = 1.000000e-04
BICGSTAB: iter 2 r_norm = 1.488459e+00 target = 1.000000e-04
BICGSTAB: iter 3 r_norm = 1.869288e-02 target = 1.000000e-04
BICGSTAB: iter 4 r_norm = 9.046827e-02 target = 1.000000e-04
BICGSTAB: iter 5 r_norm = 6.530680e-03 target = 1.000000e-04
BICGSTAB: iter 6 r_norm = 8.442263e-03 target = 1.000000e-04
BICGSTAB: iter 7 r_norm = 2.102320e-03 target = 1.000000e-04
BICGSTAB: iter 8 r_norm = 1.997315e-03 target = 1.000000e-04
BICGSTAB: iter 9 r_norm = 7.731876e-04 target = 1.000000e-04
BICGSTAB: iter 10 r_norm = 6.661457e-04 target = 1.000000e-04
BICGSTAB: iter 11 r_norm = 4.186846e-04 target = 1.000000e-04
BICGSTAB: iter 12 r_norm = 8.482786e-04 target = 1.000000e-04
BICGSTAB: iter 13 r_norm = 7.752908e-04 target = 1.000000e-04
BICGSTAB: iter 14 r_norm = 5.143060e-04 target = 1.000000e-04
BICGSTAB: iter 15 r_norm = 1.235128e-04 target = 1.000000e-04
BICGSTAB: iter 16 r_norm = 1.199492e-04 target = 1.000000e-04
BICGSTAB: iter 17 r_norm = 1.030780e-04 target = 1.000000e-04
BICGSTAB: iter 18 r_norm = 1.195428e-04 target = 1.000000e-04
BICGSTAB: iter 19 r_norm = 9.955906e-05 target = 1.000000e-04
RICHARDSON: Iter 1 r_norm_sq=2.660359e-05 Target=2.672142e-21
RICHARDSON: iter=2 Delta = 1.00000000e-02 Delta Finish= 1.00221202e-08 Delta_use= 1.00000000e-02
BICGSTAB: Relative Residuum requested
BICGSTAB: Target Rsd= 9.999999e-05
BICGSTAB: iter 1 r_norm = 3.618901e+00 target = 9.999999e-05
BICGSTAB: iter 2 r_norm = 3.003181e+00 target = 9.999999e-05
BICGSTAB: iter 3 r_norm = 7.625246e-01 target = 9.999999e-05
BICGSTAB: iter 4 r_norm = 5.732518e-01 target = 9.999999e-05
BICGSTAB: iter 5 r_norm = 4.181607e-01 target = 9.999999e-05
BICGSTAB: iter 6 r_norm = 2.707410e-01 target = 9.999999e-05
...
BICGSTAB: iter 36 r_norm = 1.151582e-03 target = 1.960039e-04
BICGSTAB: iter 37 r_norm = 2.372841e-04 target = 1.960039e-04
BICGSTAB: iter 38 r_norm = 3.495893e-04 target = 1.960039e-04
BICGSTAB: iter 39 r_norm = 2.635901e-04 target = 1.960039e-04
BICGSTAB: iter 40 r_norm = 2.092025e-04 target = 1.960039e-04
BICGSTAB: iter 41 r_norm = 1.865860e-04 target = 1.960039e-04
RICHARDSON: Iter 5 r_norm_sq=2.335266e-21 Target=2.453137e-21
RICHARDSON: Solver converged at iter 5 || r || = 9.75679787e-11
RICHARDSON: Inner MV Apps=399 Outer MV Apps=7 Inner Site Flops=189480 Outer Site Flops=669
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 5 iters, rsd_sq_final=9.51951046610738e-21
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 9.75679775308125e-11
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=0.551431894302368 (sec) Performance=200.937513308294 GFLOPS
So it's fixed?
@kostrzewa, what more checks do you want in order to believe that this issue is resolved?
HI All, I didn’t do anything to it… I don’t think it ought to have been badly broken.
Best, B
On Nov 6, 2017, at 2:34 PM, Martin Ueding notifications@github.com wrote:
This is the output of Chroma, running with one MPI process and 256 threads on a single KNL:
Veclen is 8 Soalen is 8 Inner Veclen is 16 Inner Soalen is 8 ... Starting solve RICHARDSON: RelativeResiduum requested RICHARDSON: Initial || r ||^2 = 2.67214178e-01 Target || r ||^2 = 2.67214178e-21 RICHARDSON: iter=1 Delta = 1.00000000e-02 Delta Finish= 1.00000000e-10 Delta_use= 1.00000000e-02 BICGSTAB: Relative Residuum requested BICGSTAB: Target Rsd= 1.000000e-04 BICGSTAB: iter 1 r_norm = 6.512983e-02 target = 1.000000e-04 BICGSTAB: iter 2 r_norm = 1.488459e+00 target = 1.000000e-04 BICGSTAB: iter 3 r_norm = 1.869288e-02 target = 1.000000e-04 BICGSTAB: iter 4 r_norm = 9.046827e-02 target = 1.000000e-04 BICGSTAB: iter 5 r_norm = 6.530680e-03 target = 1.000000e-04 BICGSTAB: iter 6 r_norm = 8.442263e-03 target = 1.000000e-04 BICGSTAB: iter 7 r_norm = 2.102320e-03 target = 1.000000e-04 BICGSTAB: iter 8 r_norm = 1.997315e-03 target = 1.000000e-04 BICGSTAB: iter 9 r_norm = 7.731876e-04 target = 1.000000e-04 BICGSTAB: iter 10 r_norm = 6.661457e-04 target = 1.000000e-04 BICGSTAB: iter 11 r_norm = 4.186846e-04 target = 1.000000e-04 BICGSTAB: iter 12 r_norm = 8.482786e-04 target = 1.000000e-04 BICGSTAB: iter 13 r_norm = 7.752908e-04 target = 1.000000e-04 BICGSTAB: iter 14 r_norm = 5.143060e-04 target = 1.000000e-04 BICGSTAB: iter 15 r_norm = 1.235128e-04 target = 1.000000e-04 BICGSTAB: iter 16 r_norm = 1.199492e-04 target = 1.000000e-04 BICGSTAB: iter 17 r_norm = 1.030780e-04 target = 1.000000e-04 BICGSTAB: iter 18 r_norm = 1.195428e-04 target = 1.000000e-04 BICGSTAB: iter 19 r_norm = 9.955906e-05 target = 1.000000e-04 RICHARDSON: Iter 1 r_norm_sq=2.660359e-05 Target=2.672142e-21 RICHARDSON: iter=2 Delta = 1.00000000e-02 Delta Finish= 1.00221202e-08 Delta_use= 1.00000000e-02 BICGSTAB: Relative Residuum requested BICGSTAB: Target Rsd= 9.999999e-05 BICGSTAB: iter 1 r_norm = 3.618901e+00 target = 9.999999e-05 BICGSTAB: iter 2 r_norm = 3.003181e+00 target = 9.999999e-05 BICGSTAB: iter 3 r_norm = 7.625246e-01 target = 9.999999e-05 BICGSTAB: iter 4 r_norm = 5.732518e-01 target = 9.999999e-05 BICGSTAB: iter 5 r_norm = 4.181607e-01 target = 9.999999e-05 BICGSTAB: iter 6 r_norm = 2.707410e-01 target = 9.999999e-05 ... BICGSTAB: iter 36 r_norm = 1.151582e-03 target = 1.960039e-04 BICGSTAB: iter 37 r_norm = 2.372841e-04 target = 1.960039e-04 BICGSTAB: iter 38 r_norm = 3.495893e-04 target = 1.960039e-04 BICGSTAB: iter 39 r_norm = 2.635901e-04 target = 1.960039e-04 BICGSTAB: iter 40 r_norm = 2.092025e-04 target = 1.960039e-04 BICGSTAB: iter 41 r_norm = 1.865860e-04 target = 1.960039e-04 RICHARDSON: Iter 5 r_norm_sq=2.335266e-21 Target=2.453137e-21 RICHARDSON: Solver converged at iter 5 || r || = 9.75679787e-11 RICHARDSON: Inner MV Apps=399 Outer MV Apps=7 Inner Site Flops=189480 Outer Site Flops=669 QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 5 iters, rsd_sq_final=9.51951046610738e-21 QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 9.75679775308125e-11 QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=0.551431894302368 (sec) Performance=200.937513308294 GFLOPS
So it's fixed?
@kostrzewa, what more checks do you want in order to believe that this issue is resolved?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
We changed a bunch of things, so this could have happen from a lot of different commits or branches that we had. If it is fixed now, I would be very happy about that. But I would think that there are still some doubts that need to be resolved before I can mark this as fixed.
@martin-ueding in your final test, was your local lattice volume 16x8^3 ? I'll give a try with the current devel to see if it works for me too.
My tests were with the 24³×96 lattice from our production. However, iirc I was able to reproduce the error with that lattice as well. So the bug might not depend so strongly on the local lattice size. But please run your test, if it is gone there as well, we should be good to go.
At least this is a loud error, so this will not go by unnoticed.
I can confirm that this has magically fixed itself for the test case I presented above. Closing! Thanks @martin-ueding for all the tests!
I've just observed a rather curious failure of the Richardson solver (using CG as the underlying algorithm with my MMdag addition).
For the double-single solver In the strong scaling limit with a local lattice volume of 16x8^3 (xyzt), SOALEN=8, VECLEN_DP=8, VECLEN_SP=16 (on KNL), the solver diverges:
It gets worse from there.
The weird thing is: the single precision CG with the same settings work fine, as does the double precision CG. Is there a problem with up-conversion?!