Weird mixed solver failure in strong scaling limit on KNL (double-single)

kostrzewa commented 7 years ago

I've just observed a rather curious failure of the Richardson solver (using CG as the underlying algorithm with my MMdag addition).

For the double-single solver In the strong scaling limit with a local lattice volume of 16x8^3 (xyzt), SOALEN=8, VECLEN_DP=8, VECLEN_SP=16 (on KNL), the solver diverges:

# INITIALIZING QPHIX MIXED SOLVER
# USING DOUBLE-SINGLE PRECISION
# QphiX: VECLEN=8 SOALEN=8 VECLEN_inner=16, SOALEN_inner=8
# QphiX: Declared QMP Topology (xyzt): 1 2 2 4
# QphiX: Mapping of dimensions QMP -> tmLQCD (xyzt): 0->2 1->1 2->0 3->3
# QphiX: Global Lattice Size (xyzt) =  16 16 16 32
# QphiX: Local Lattice Size (xyzt) =  16 8 8 8
# QphiX: Block Sizes: By= 4 Bz=4
# QphiX: Cores = 4
# QphiX: SMT Grid: Sy=1 Sz=4
# QphiX: Pad Factors: PadXY=1 PadXYZ=0
# QphiX: Threads_per_core = 4
# QphiX: MinCt = 1
# QphiX: Using two-row gauge compression (compress12)
# QphiX: Inner solver using two-row gauge compression (compress12)
# Setting number of BLAS threads...
# ...done.
# QPHIX-interface: time spent in reorder_gauge_to_QPhiX: 1.514116 secs
# QPHIX-interface: time spent in reorder_gauge_to_QPhiX: 0.029403 secs
# Creating QPhiX Wilson Fermion Matrix...
Setting Up Blockinfo array: num_phases=1
Phase info set up
Precomputing offsets
WILL Use 12 compression
Setting Up Blockinfo array: num_phases=1
Phase info set up
Precomputing offsets
WILL Use 12 compression
# ...done.
# QPHIX: Creating mixed-precision CG solver...
Initializing CG Solver: Nvec=1 Ny=8 Nz=8 Nt=8
# ...done.
# QPHIX-interface: time spent in reorder_eo_spinor_to_QPhiX: 0.000430 secs
# Calling the solver...
RICHARDSON: Initial || r ||^2 =   3.88378240e-01   Target || r ||^2 =   3.88378240e-15
RICHARDSON: iter=1 Delta =   9.99999978e-03  Delta Finish=  1.00000000e-07 Delta_use=  9.99999978e-03
Entering the CG inverter with num_flav=1 and cb=1
Chi_sq=0.740768 RsdTarget=0.01
CG: r0 = 0.740768   target = 7.40768e-05 
CG: iter 0:  r2 = 0.106045   target = 7.40768e-05 
CG: iter 1:  r2 = 0.0353281   target = 7.40768e-05 
CG: iter 2:  r2 = 0.0164114   target = 7.40768e-05 
CG: iter 3:  r2 = 0.00880791   target = 7.40768e-05 
CG: iter 4:  r2 = 0.00550385   target = 7.40768e-05 
CG: iter 5:  r2 = 0.00364526   target = 7.40768e-05 
CG: iter 6:  r2 = 0.00255914   target = 7.40768e-05 
CG: iter 7:  r2 = 0.00188779   target = 7.40768e-05 
CG: iter 8:  r2 = 0.00145908   target = 7.40768e-05 
CG: iter 9:  r2 = 0.00112278   target = 7.40768e-05 
CG: iter 10:  r2 = 0.000908794   target = 7.40768e-05 
CG: iter 11:  r2 = 0.000769757   target = 7.40768e-05 
CG: iter 12:  r2 = 0.000666034   target = 7.40768e-05 
CG: iter 13:  r2 = 0.000581498   target = 7.40768e-05 
CG: iter 14:  r2 = 0.000509952   target = 7.40768e-05 
CG: iter 15:  r2 = 0.000453765   target = 7.40768e-05 
CG: iter 16:  r2 = 0.000402794   target = 7.40768e-05 
CG: iter 17:  r2 = 0.000362916   target = 7.40768e-05 
CG: iter 18:  r2 = 0.000324405   target = 7.40768e-05 
CG: iter 19:  r2 = 0.000287272   target = 7.40768e-05 
CG: iter 20:  r2 = 0.000254348   target = 7.40768e-05 
CG: iter 21:  r2 = 0.000227314   target = 7.40768e-05 
CG: iter 22:  r2 = 0.000202051   target = 7.40768e-05 
CG: iter 23:  r2 = 0.000179997   target = 7.40768e-05 
CG: iter 24:  r2 = 0.000160796   target = 7.40768e-05 
CG: iter 25:  r2 = 0.000145391   target = 7.40768e-05 
CG: iter 26:  r2 = 0.000133474   target = 7.40768e-05 
CG: iter 27:  r2 = 0.000124172   target = 7.40768e-05 
CG: iter 28:  r2 = 0.000116532   target = 7.40768e-05 
CG: iter 29:  r2 = 0.000109199   target = 7.40768e-05 
CG: iter 30:  r2 = 0.000101813   target = 7.40768e-05 
CG: iter 31:  r2 = 9.32433e-05   target = 7.40768e-05 
CG: iter 32:  r2 = 8.48534e-05   target = 7.40768e-05 
CG: iter 33:  r2 = 7.63717e-05   target = 7.40768e-05 
CG: iter 34:  r2 = 6.83745e-05   target = 7.40768e-05 
RICHARDSON: Iter 1   r_norm_sq=2.453367e+00   Target=3.883782e-15
RICHARDSON: iter=2 Delta =   9.99999978e-03  Delta Finish=  3.97874596e-08 Delta_use=  9.99999978e-03
Entering the CG inverter with num_flav=1 and cb=1

It gets worse from there.

The weird thing is: the single precision CG with the same settings work fine, as does the double precision CG. Is there a problem with up-conversion?!

kostrzewa commented 7 years ago

The same setup works like a charm with SOALEN=4

kostrzewa commented 7 years ago

A cross-check of this would be highly appreciated...

kostrzewa commented 7 years ago

@martin-ueding you were wondering if there were something that you could do right now. A cross-check of this issue would be excellent! I would really prefer to run with SOALEN=8 on Marconi, but right now, I can't do that ...

bjoo commented 7 years ago

Hi All, I have not had time to look at this. I tried running on NERSC, where I ran into some solver nonconvergence, however I think I got that also with regular BiCGStab so I expect it is actually physics in my case. I will investigate this in more depth later when I have the chance.

Best, B

On Aug 7, 2017, at 9:20 AM, Bartosz Kostrzewa notifications@github.com wrote:

@martin-ueding you were wondering if there were something that you could do right now. A cross-check of this issue would be excellent! I would really prefer to run with SOALEN=8 on Marconi, but right now, I can't do that ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

@kostrzewa: What is the problem? SOALEN = 8 does not compile? Marconi A2 is down? I guess you are running with tmLQCD, where can I find the input file and the configuration that you are testing?

kostrzewa commented 7 years ago

@martin-ueding

@kostrzewa: What is the problem? SOALEN = 8 does not compile?

See the description, the mixed solver diverges for soalen = 8 (see r_norm_sq after the first outer iteration), even though the individual solvers converge just fine and give correct results. I suspect there is an issue with the iterative refinement (passing low precision update of 'x' and 'r' to high precision).

It would be important to test this outside of tmLQCD.

martin-ueding commented 7 years ago

So you want a unit test for the Richardson solver in the CG mode on the QPhiX side that just does an inversion and checks it by multiplying back (or comparing to QDP or a different QPhiX solver)?

You wrote that it only fails in the strong scaling limit with that 16×8³ lattice size (bounded by DDalphaAMG I presume). In the output, there is

# QphiX: Local Lattice Size (xyzt) =  16 8 8 8

After checkerboarding, this will be 8⁴. And then VECLEN_inner=16, SOALEN_inner=8, so this is indeed the smallest X-extent that should be possible, right? At first I thought that this might cause problems, that that is exactly what we have the vector folding for.

I'll look into writing a unit test for your branch then.

kostrzewa commented 7 years ago

So you want a unit test for the Richardson solver in the CG mode on the QPhiX side that just does an inversion and checks it by multiplying back (or comparing to QDP or a different QPhiX solver)?

The unit test is probably a good idea, but it would be sufficient to run an inversion using Chroma with the Richardson solver (the BiCGstab version) on Marconi A2.

kostrzewa commented 7 years ago

After checkerboarding, this will be 8⁴. And then VECLEN_inner=16, SOALEN_inner=8, so this is indeed the smallest X-extent that should be possible, right? At first I thought that this might cause problems, that that is exactly what we have the vector folding for.

Yes, it should work. I will also try again now that I've pushed so many commits. Perhaps there was some inconsistency somewhere when I tested this...

martin-ueding commented 7 years ago

To summarize: I have tried Bartek's branch with some fixes for compilation on Marconi A2 (KNL). With Chroma set up for soalen=8 I ran a simple inversion (Chroma propagator) using the ITER_REFINE solver with BICGSTAB on a thermalized Wilson clover configuration with sea and valence quark masses equal. This did not converge:

RICHARDSON: Initial || r ||^2 =   2.67214178e-01   Target || r ||^2 =   2.67214178e-21
RICHARDSON: Solver NOT converged at iter 10000 || r || =   1.33743272e+00
RICHARDSON: Inner MV Apps=409933 Outer MV Apps=10001 Inner Site Flops=192688248 Outer Site Flops=960093
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 10000 iters,  rsd_sq_final=1.78872626910592
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 1.33743271573037
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=569.290246963501 (sec)  Performance=200.965667717499 GFLOPS

However, with soalen=4, it works:

RICHARDSON: Initial || r ||^2 =   2.67214178e-01   Target || r ||^2 =   2.67214178e-21
RICHARDSON: Solver converged at iter 5 || r || =   9.09133473e-11
RICHARDSON: Inner MV Apps=377 Outer MV Apps=7 Inner Site Flops=178920 Outer Site Flops=669
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 5 iters,  rsd_sq_final=8.26523672478744e-21
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 9.09133449338274e-11
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=7.01472187042236 (sec)  Performance=14.9375031745472 GFLOPS

So the problem depends on the soalen which means that it is some implementation detail.

kostrzewa commented 7 years ago

Thanks for the cross-check! Since it is apparently independent of the inner solver (I was using CG with my MdagM addition to the Richardson solver), it seems likely that there's a bug somewhere in the up-conversion. As a final test, just to cross-check what I did, you could also try simple BiCGstab in single and double precision with soalen=8, just to confirm that the inner solver indeed works correctly.

martin-ueding commented 7 years ago

The following BiCGStab configurations work:

Precision	SoA Length
double	4
double	8
single	8

In the Richardson case, my results are the following on Marconi A2 (KNL) are:

Outer veclen	Outer soalen	Inner veclen	Inner soalen	Result
8	8	16	4	Diverges
8	4	16	4	9.09133473e-11
8	8	16	8	Breakdown in iteration 1. rho = 0

I have looked into the source code of the Richardson solver and the conversion code. The index manipulations looks legit, the types seem to be passed correctly.

From the number it seems that the issue could be a mismatch between the SoA length in the inner and outer solver.

bjoo commented 7 years ago

Hi All, It is great that you are doing these tests. Can you remind me of the local lattice volume?

(Assuming you are using outer=double, inner = single)

Outer Outer Inner Inner Veclen Soalen Veclen Soalen 8 4 16 8

?

Best, B

On Aug 10, 2017, at 7:34 AM, Martin Ueding notifications@github.com wrote:

The following BiCGStab configurations work:

Precision SoA Length double 4 double 8 single 8 In the Richardson case, my results are the following on Marconi A2 (KNL) are:

Outer veclen Outer soalen Inner veclen Inner soalen Result 8 8 16 4 Diverges 8 4 16 4 9.09133473e-11 8 8 16 8 In Investigation I have looked into the source code of the Richardson solver and the conversion code. The index manipulations looks legit, the types seem to be passed correctly.

From the number it seems that the issue could be a mismatch between the SoA length in the inner and outer solver.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

The volume here is a L=16 and T=32 lattice on a single MPI rank. So even after checkerboarding, a SoA length of 8 should be just fine.

martin-ueding commented 7 years ago

I have updated the table above, with SoA length 8 for both, it does not work. So it seems that the SoA length has to be the same and it has to be 4. We could run more tests, like SoA length 8/16 for outer/inner solvers. Would that make sense?

What tests could be done within QPhiX to sort this out?

kostrzewa commented 7 years ago

Thanks for the check, Martin. It seems pretty clear that we observe the same issue. (what do you mean, rho=0? The preconditioning mass?)

I think now one just needs to confirm if the up-/down-conversion in the Richardson iteration works correctly for the problematic case(s). This should be relatively easy to test, since one simply tests for equality up to the low precision (times some fudge factor).

If that's not the culprit, then there's perhaps something more nefarious going on...

martin-ueding commented 7 years ago

The rho = 0 is some output from the solver. It has to be some variable in the multi-prec solver. I just copied that for reference.

I can try to write such a unit test in the next days.

kostrzewa commented 7 years ago

Okay, I checked and rho comes from BiCGstab. One of the two residuals calculated in the BiCGstab iteration is zero, which would suggest that the there are zero-norm inputs passed to the solver. Does this happen in the first outer iteration or the second?

martin-ueding commented 7 years ago

This is what I get:

Starting solve                                                                  
RICHARDSON: Initial || r ||^2 =   2.67214178e-01   Target || r ||^2 =   2.67214178e-21
BICGSTAB: Breakdown in iteration 1. rho = 0                                     
BICGSTAB: Breakdown in iteration 1. rho = 0

The last line is repeated until it runs into the time limit.

The source of this apparently is the following line:

include/qphix/invbicgstab.h:142:        masterPrintf("BICGSTAB: Breakdown in iteration %d. rho = 0\n", k);

The rho_c is the inner product of r_0 and r computed with the innerProduct function. So one of the vectors is zero.

There is a verbosity flag that one can turn on, I have just done this and restarted the job that produced this line. The additional output should be helpful:

RICHARDSON: iter=120 Delta =   1.00000000e-02  Delta Finish=  1.48489145e-48 Delta_use=  1.00000000e-02
BICGSTAB: ||b||^2 = 1.812891e-01 Target Rsd= 1.812891e-05
BICGSTAB: iter 1 r_norm = 1.034537e-02  target = 1.812891e-05 
BICGSTAB: iter 2 r_norm = 8.359862e-02  target = 1.812891e-05 
BICGSTAB: iter 3 r_norm = 3.579327e-03  target = 1.812891e-05 
BICGSTAB: iter 4 r_norm = 3.292702e-02  target = 1.812891e-05 
BICGSTAB: iter 5 r_norm = 1.155886e-01  target = 1.812891e-05 
BICGSTAB: iter 6 r_norm = 1.879570e-01  target = 1.812891e-05 
BICGSTAB: iter 7 r_norm = 3.998165e-02  target = 1.812891e-05 
BICGSTAB: iter 8 r_norm = 3.424387e-02  target = 1.812891e-05 
BICGSTAB: iter 9 r_norm = 1.201591e-02  target = 1.812891e-05 
BICGSTAB: iter 10 r_norm = 6.793151e-03  target = 1.812891e-05 
BICGSTAB: iter 11 r_norm = 7.557945e-03  target = 1.812891e-05 
BICGSTAB: iter 12 r_norm = 6.183899e-03  target = 1.812891e-05 
BICGSTAB: iter 13 r_norm = 5.557747e-03  target = 1.812891e-05 
BICGSTAB: iter 14 r_norm = 4.977096e-02  target = 1.812891e-05 
BICGSTAB: iter 15 r_norm = 6.504697e-02  target = 1.812891e-05 
BICGSTAB: iter 16 r_norm = 1.235174e-02  target = 1.812891e-05 
BICGSTAB: iter 17 r_norm = 3.707913e-03  target = 1.812891e-05 
BICGSTAB: iter 18 r_norm = 2.026293e-03  target = 1.812891e-05 
BICGSTAB: iter 19 r_norm = 1.193662e-03  target = 1.812891e-05 
BICGSTAB: iter 20 r_norm = 1.218229e-03  target = 1.812891e-05 
BICGSTAB: iter 21 r_norm = 3.733477e-04  target = 1.812891e-05 
BICGSTAB: iter 22 r_norm = 2.647406e-04  target = 1.812891e-05 
BICGSTAB: iter 23 r_norm = 2.384216e-04  target = 1.812891e-05 
BICGSTAB: iter 24 r_norm = 3.210701e-04  target = 1.812891e-05 
BICGSTAB: iter 25 r_norm = 1.881972e-04  target = 1.812891e-05 
BICGSTAB: iter 26 r_norm = 1.179646e-04  target = 1.812891e-05 
BICGSTAB: iter 27 r_norm = 7.464495e-05  target = 1.812891e-05 
BICGSTAB: iter 28 r_norm = 6.361845e-05  target = 1.812891e-05 
BICGSTAB: iter 29 r_norm = 5.130439e-05  target = 1.812891e-05 
BICGSTAB: iter 30 r_norm = 4.679175e-05  target = 1.812891e-05 
BICGSTAB: iter 31 r_norm = 3.777697e-05  target = 1.812891e-05 
BICGSTAB: iter 32 r_norm = 3.102388e-05  target = 1.812891e-05 
BICGSTAB: iter 33 r_norm = 2.507007e-05  target = 1.812891e-05 
BICGSTAB: iter 34 r_norm = 2.522580e-05  target = 1.812891e-05 
BICGSTAB: iter 35 r_norm = 2.673134e-05  target = 1.812891e-05 
BICGSTAB: iter 36 r_norm = 2.607818e-05  target = 1.812891e-05 
BICGSTAB: iter 37 r_norm = 1.293327e-04  target = 1.812891e-05 
BICGSTAB: iter 38 r_norm = 2.408174e-04  target = 1.812891e-05 
BICGSTAB: iter 39 r_norm = 1.883567e-04  target = 1.812891e-05 
BICGSTAB: iter 40 r_norm = 3.284182e-05  target = 1.812891e-05 
BICGSTAB: iter 41 r_norm = 6.708810e-06  target = 1.812891e-05 
RICHARDSON: Iter 120   r_norm_sq=5.724560e+75   Target=2.672142e-21
RICHARDSON: iter=121 Delta =   1.00000000e-02  Delta Finish=  6.83217046e-49 Delta_use=  1.00000000e-02
BICGSTAB: ||b||^2 = 1.814572e-01 Target Rsd= 1.814572e-05
BICGSTAB: iter 1 r_norm = 1.035304e-02  target = 1.814572e-05 
BICGSTAB: iter 2 r_norm = 8.370212e-02  target = 1.814572e-05 
BICGSTAB: iter 3 r_norm = 3.580684e-03  target = 1.814572e-05 
BICGSTAB: iter 4 r_norm = 3.290278e-02  target = 1.814572e-05 
BICGSTAB: iter 5 r_norm = 1.154639e-01  target = 1.814572e-05 
BICGSTAB: iter 6 r_norm = 1.865597e-01  target = 1.814572e-05 
BICGSTAB: iter 7 r_norm = 3.978468e-02  target = 1.814572e-05 
BICGSTAB: iter 8 r_norm = 3.364065e-02  target = 1.814572e-05 
BICGSTAB: iter 9 r_norm = 1.200913e-02  target = 1.814572e-05 
BICGSTAB: iter 10 r_norm = 6.669762e-03  target = 1.814572e-05 
BICGSTAB: iter 11 r_norm = 7.425735e-03  target = 1.814572e-05 
BICGSTAB: iter 12 r_norm = 6.067449e-03  target = 1.814572e-05 
BICGSTAB: iter 13 r_norm = 5.632251e-03  target = 1.814572e-05 
BICGSTAB: iter 14 r_norm = 5.506331e-02  target = 1.814572e-05 
BICGSTAB: iter 15 r_norm = 8.002124e-02  target = 1.814572e-05 
BICGSTAB: iter 16 r_norm = 1.425816e-02  target = 1.814572e-05 
BICGSTAB: iter 17 r_norm = 4.207211e-03  target = 1.814572e-05 
BICGSTAB: iter 18 r_norm = 2.547295e-03  target = 1.814572e-05 
BICGSTAB: iter 19 r_norm = 1.525623e-03  target = 1.814572e-05 
BICGSTAB: iter 20 r_norm = 1.640748e-03  target = 1.814572e-05 
BICGSTAB: iter 21 r_norm = 3.953916e-04  target = 1.814572e-05 
BICGSTAB: iter 22 r_norm = 2.853951e-04  target = 1.814572e-05 
BICGSTAB: iter 23 r_norm = 2.437643e-04  target = 1.814572e-05 
BICGSTAB: iter 24 r_norm = 3.388092e-04  target = 1.814572e-05 
BICGSTAB: iter 25 r_norm = 1.831258e-04  target = 1.814572e-05 
BICGSTAB: iter 26 r_norm = 1.150974e-04  target = 1.814572e-05 
BICGSTAB: iter 27 r_norm = 8.481940e-05  target = 1.814572e-05 
BICGSTAB: iter 28 r_norm = 6.993216e-05  target = 1.814572e-05 
BICGSTAB: iter 29 r_norm = 5.573235e-05  target = 1.814572e-05 
BICGSTAB: iter 30 r_norm = 4.996171e-05  target = 1.814572e-05 
BICGSTAB: iter 31 r_norm = 3.904734e-05  target = 1.814572e-05 
BICGSTAB: iter 32 r_norm = 3.209256e-05  target = 1.814572e-05 
BICGSTAB: iter 33 r_norm = 2.457926e-05  target = 1.814572e-05 
BICGSTAB: iter 34 r_norm = 2.454816e-05  target = 1.814572e-05 
BICGSTAB: iter 35 r_norm = 2.027693e-05  target = 1.814572e-05 
BICGSTAB: iter 36 r_norm = 3.923647e-04  target = 1.814572e-05 
BICGSTAB: iter 37 r_norm = 4.092540e-05  target = 1.814572e-05 
BICGSTAB: iter 38 r_norm = 6.020191e-05  target = 1.814572e-05 
BICGSTAB: iter 39 r_norm = 2.355953e-04  target = 1.814572e-05 
BICGSTAB: iter 40 r_norm = 1.019154e-05  target = 1.814572e-05 
RICHARDSON: Iter 121   r_norm_sq=2.707322e+76   Target=2.672142e-21
RICHARDSON: iter=122 Delta =   1.00000000e-02  Delta Finish=  3.14166461e-49 Delta_use=  1.00000000e-02
BICGSTAB: ||b||^2 = 0.000000e+00 Target Rsd= 0.000000e+00
BICGSTAB: Breakdown in iteration 1. rho = 0
RICHARDSON: Iter 122   r_norm_sq=2.707322e+76   Target=2.672142e-21
RICHARDSON: iter=123 Delta =   1.00000000e-02  Delta Finish=  3.14166461e-49 Delta_use=  1.00000000e-02
BICGSTAB: ||b||^2 = 0.000000e+00 Target Rsd= 0.000000e+00
BICGSTAB: Breakdown in iteration 1. rho = 0
RICHARDSON: Iter 123   r_norm_sq=2.707322e+76   Target=2.672142e-21
RICHARDSON: iter=124 Delta =   1.00000000e-02  Delta Finish=  3.14166461e-49 Delta_use=  1.00000000e-02
BICGSTAB: ||b||^2 = 0.000000e+00 Target Rsd= 0.000000e+00
BICGSTAB: Breakdown in iteration 1. rho = 0

The complete output is large, here is the full thing: richardson.o331444.tar.gz

It seems to me that the inner solver needs 40-ish iterations and that the outer solver does 121 iterations without a problem. Then on the 122nd outer iteration the inner solver directly has a problem. So it does not happen right away but it seems that the residual is diverging.

If you look at the residual squared in terms of the outer iteration number, one sees a linear trend in a half-log plot, meaning an exponential growth:

plot

So this looks like a numerical instability where errors accumulate in an exponential fashion. But only for this particular soalen configuration.

I interpret this as something being just slightly wrong, and I lack the intuition to know the most promising starting point would be.

kostrzewa commented 7 years ago

Okay, so it diverges, just like in my case. Good to run with maximum verbosity when testing things ;D

kostrzewa commented 7 years ago

So this looks like a numerical instability where errors accumulate in an exponential fashion. But only for this particular soalen configuration.

No, it's not a numerical instability, otherwise the solver would always diverge. It is, with a high degree of probability, a technical problem related to the up- and down-conversion between inner and outer precisions. We can guess this because BiCGstab (and CG, for that matter) itself converge fine in either precision, as we've both tested, whatever SOALEN is used.

martin-ueding commented 7 years ago

You beat me to it, so this is somewhat redundant.

Just to summarize: We have established that the individual CG and BiCGStab solvers work with all soalen settings. We know that the Richardson solvers works with both CG and BiCGStab with soalen=4. So the problem is something inherent to the Richardson solver that depends on the soalen.

I have rewritten the convert function to make it more readable:

template <typename FTOut,
          int VOut,
          int SOut,
          bool CompressOut,
          typename FTIn,
          int VIn,
          int SIn,
          bool CompressIn>
void convert(
    typename Geometry<FTOut, VOut, SOut, CompressOut>::FourSpinorBlock *spinor_out,
    double scale_factor,
    const typename Geometry<FTIn, VIn, SIn, CompressIn>::FourSpinorBlock *spinor_in,
    const Geometry<FTOut, VOut, SOut, CompressOut> &geom_out,
    const Geometry<FTIn, VIn, SIn, CompressIn> &geom_in,
    int n_blas_threads)
{
  typedef typename ArithType<FTOut>::Type AT_out;

  // Get the subgrid latt size.
  int Nt = geom_out.Nt();
  int Nz = geom_out.Nz();
  int Ny = geom_out.Ny();
  int Nxh = geom_out.Nxh();
  int nvecs_out = geom_out.nVecs();
  int Pxy_out = geom_out.getPxy();
  int Pxyz_out = geom_out.getPxyz();

  int nvecs_in = geom_in.nVecs();
  int Pxy_in = geom_in.getPxy();
  int Pxyz_in = geom_in.getPxyz();

#pragma omp parallel for collapse(4)
  for (int t = 0; t < Nt; t++) {
    for (int z = 0; z < Nz; z++) {
      for (int y = 0; y < Ny; y++) {
        for (int s = 0; s < nvecs_out; s++) {
          for (int col = 0; col < 3; col++) {
            for (int spin = 0; spin < 4; spin++) {
              for (int x = 0; x < SOut; x++) {
                // Index of the soavector from the output.
                int const ind_out = t * Pxyz_out + z * Pxy_out + y * nvecs_out + s;

                // Actual X coordinate is given by the number of elements in each
                // soavector `SOut` and the position within the soavector `x`.
                int const x_coord = s * SOut + x;

                // The soavector on the current line of X elements is computed from the
                // actual X coordinate `x_coord` and the input soalen `SIn`.
                int const s_in = x_coord / SIn;

                // The position within the input soavector is given by subtracting the X
                // coordinate of the head of the input soavector from the actual X
                // coordinate.
                int const x_in = x_coord - SIn * s_in;

                // The index of the input soavector is computed analogous to the output
                // soavector.
                int const ind_in = t * Pxyz_in + z * Pxy_in + y * nvecs_in + s_in;

                // Iterate over real and imaginary part.
                for (int const reim : {0, 1}) {
                    // Extract source and target from the data structures.
                    auto const source = spinor_in[ind_in][col][spin][reim][x_in];
                    auto &target = spinor_out[ind_out][col][spin][reim][x];

                    // Convert the input numbers into the arithmetic type.
                    auto const scale_factor_rep = rep<AT_out, double>(scale_factor);
                    auto const source_rep = rep<AT_out, FTIn>(source);

                    // Perform the multiplication within the arithmetic types.
                    auto const result_rep = scale_factor_rep * source_rep;

                    // Convert the result back into the storage type.
                    auto const result = rep<FTOut, AT_out>(result_rep);

                    // Assign the result into the target array element.
                    target = result;
                }
              }
            }
          }
        }
      }
    }
  }
}

I do not think that there is some error in the index manipulations, it seems reasonable. The next step probably is a simple unit test of the convert function such that we can really exclude this as the source of errors?

kostrzewa commented 7 years ago

The other possible issue could be that the scale factor is computed and/or passed incorrectly for some reason. (this is a re-scaling by the current residual which is meant to prevent low-precision underruns)

kostrzewa commented 7 years ago

@bjoo Did you have a chance to also look into this?

martin-ueding commented 7 years ago

The error message with rho = 0 comes from the BiCGStab itself, not from the Richardson solver. This is where it comes from:

    // Check norm of r0
    norm2Spinor<FT, V, S, compress12, num_flav>(r_norm, r0, geom, norm2Threads);
    if (r_norm < rsd_sq) {
      masterPrintf("BICGSTAB converged at iter 0: ||r||=%16.8e target=%16.8e\n",
                   sqrt(r_norm),
                   sqrt(rsd_sq));
      rsd_sq_final = r_norm;
      n_iters = 0;
      return;
    }

    // r = r0
    copySpinor<FT, V, S, compress12, num_flav>(r, r0, geom, copyThreads);

    // Now intitialize p and v to zero.
    zeroSpinor<FT, V, S, compress12, num_flav>(p, geom, zeroThreads);
    zeroSpinor<FT, V, S, compress12, num_flav>(v, geom, zeroThreads);

    // rho_prev = 1
    double rho_prev_c[2] = {1.0, 0.0};

    // alpha = 1
    double alpha_c[2] = {1.0, 0.0};

    // omega = 1
    double omega_c[2] = {1.0, 0.0};

    double rho_c[2] = {0.0, 0.0};

    bool notConvP = true;
    int k = 1;
    // Now iterate
    for (k = 1; (k <= MaxIters) && notConvP; k++) {

      // rho_{k+1} = < r_0 | r >
      innerProduct<FT, V, S, compress12, num_flav>(
          rho_c, r0, r, geom, innerProductThreads);
      site_flops += 8 * 12 * num_flav;

      if ((rho_c[0] == 0.0) && (rho_c[1] == 0.0)) {
        masterPrintf("BICGSTAB: Breakdown in iteration %d. rho = 0\n", k);

        rsd_sq_final = r_norm;
        n_iters = k;
        return;
      }

We have breakdown in iteration 1, so that is the very first iteration of that for loop. This means that the setup code above has just run. And there is a copy from r0 to r. When we take the scalar product within the for loop, we get the norm squared of that vector. This can only be exactly zero if those spinors were exactly zero. A == 0.0 comparison is very brittle, so it usually only works when we have exact 0.0 in all the elements.

But the norm of r0 is computed right before it is copied into r. And if r0 indeed had a norm squared of exactly 0.0, the BiCGStab would have stopped right there, because a vanishing residual indicates convergence.

This does not make any sense, right? We should never see the breakdown in iteration 1, rho = 0 error message, we should either see convergence a few lines of code above or it should do a proper iteration. It should be impossible to have a breakdown in iteration 1.

So this means that perhaps the innerProduct or the norm2Spinor function does something wrong? But I did check that, so it cannot be just that.

We have a problem that manifests itself in the BiCGStab, but only when used in the Richardson solver. But also when one uses the CG, similar things happen. So it must be inherent to the Richardson solver. But there also is seemingly impossible behavior in the BiCGStab.

What are we missing here?

bjoo commented 7 years ago

In principle this can arise if both the RHS and the initial Guess are zero? Does the breakdown occur with random initial guess?

Best, B

On Nov 6, 2017, at 2:08 PM, Martin Ueding notifications@github.com wrote:

The error message with rho = 0 comes from the BiCGStab itself, not from the Richardson solver. This is where it comes from:
// Check norm of r0
norm2Spinor<FT, V, S, compress12, num_flav>(r_norm, r0, geom, norm2Threads);
if (r_norm < rsd_sq) {
  masterPrintf("BICGSTAB converged at iter 0: ||r||=%16.8e target=%16.8e\n",
               sqrt(r_norm),
               sqrt(rsd_sq));
  rsd_sq_final = r_norm;
  n_iters = 0;
  return;
}

// r = r0
copySpinor<FT, V, S, compress12, num_flav>(r, r0, geom, copyThreads);

// Now intitialize p and v to zero.
zeroSpinor<FT, V, S, compress12, num_flav>(p, geom, zeroThreads);
zeroSpinor<FT, V, S, compress12, num_flav>(v, geom, zeroThreads);

// rho_prev = 1
double rho_prev_c[2] = {1.0, 0.0};

// alpha = 1
double alpha_c[2] = {1.0, 0.0};

// omega = 1
double omega_c[2] = {1.0, 0.0};

double rho_c[2] = {0.0, 0.0};

bool notConvP = true;
int k = 1;
// Now iterate
for (k = 1; (k <= MaxIters) && notConvP; k++) {

  // rho_{k+1} = < r_0 | r >
  innerProduct<FT, V, S, compress12, num_flav>(
      rho_c, r0, r, geom, innerProductThreads);
  site_flops += 8 * 12 * num_flav;

  if ((rho_c[0] == 0.0) && (rho_c[1] == 0.0)) {
    masterPrintf("BICGSTAB: Breakdown in iteration %d. rho = 0\n", k);

    rsd_sq_final = r_norm;
    n_iters = k;
    return;
  }
We have breakdown in iteration 1, so that is the very first iteration of that for loop. This means that the setup code above has just run. And there is a copy from r0 to r. When we take the scalar product within the for loop, we get the norm squared of that vector. This can only be exactly zero if those spinors were exactly zero. A == 0.0 comparison is very brittle, so it usually only works when we have exact 0.0 in all the elements.

But the norm of r0 is computed right before it is copied into r. And if r0 indeed had a norm squared of exactly 0.0, the BiCGStab would have stopped right there, because a vanishing residual indicates convergence.

This does not make any sense, right? We should never see the breakdown in iteration 1, rho = 0 error message, we should either see convergence a few lines of code above or it should do a proper iteration. It should be impossible to have a breakdown in iteration 1.

So this means that perhaps the innerProduct or the norm2Spinor function does something wrong? But I did check that, so it cannot be just that.

We have a problem that manifests itself in the BiCGStab, but only when used in the Richardson solver. But also when one uses the CG, similar things happen. So it must be inherent to the Richardson solver. But there also is seemingly impossible behavior in the BiCGStab.

What are we missing here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

If the right-hand-side and the initial guess are zero, then r0 will hold a zero value. But then in the next step, the square norm of r0 is computed, and if that is below the threshold, the solver has converged. This is not happening, therefore r0 does not have zero norm, and either initial guess or right-hand-side are non-zero.

bjoo commented 7 years ago

HI Martin, I’ve just noticed you say that this is in the ‘strong scaling limit’ What are the lattice sizes (local)? Whatis the SOALEN? Best, B

On Nov 6, 2017, at 2:14 PM, Martin Ueding notifications@github.com wrote:

If the right-hand-side and the initial guess are zero, then r0 will hold a zero value. But then in the next step, the square norm of r0 is computed, and if that is below the threshold, the solver has converged. This is not happening, therefore r0 does not have zero norm, and either initial guess or right-hand-side are non-zero.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

Our strong scaling limit is imposed by DDalphaAMG and that limits us to 16×8³ of a local lattice.

martin-ueding commented 7 years ago

And see this older comment about the SOALEN that we are using. This problem seem to occur in various manifestations. I am currently looking at the soalen = 8 case with double/single because there the error message is somewhat clear to locate.

martin-ueding commented 7 years ago

From the older comment, it seems that the following lead to a diverging problem:

Outer veclen: 8
Outer soalen: 8
Inner veclen: 16
Inner soalen: 4

I have just run this again with the brand new version of QPhiX, and it did converge! Now I will compile again with soalen-8-8 and see whether the problem has vanished there. If so, that would be marvelous. Bartek would have to cross-check, but it might be that this bug has vanished by itself for some reason.

martin-ueding commented 7 years ago

This is the output of Chroma, running with one MPI process and 256 threads on a single KNL:

Veclen is 8
Soalen is 8
Inner Veclen is 16
Inner Soalen is 8
...
Starting solve
RICHARDSON: RelativeResiduum requested
RICHARDSON: Initial || r ||^2 =   2.67214178e-01   Target || r ||^2 =   2.67214178e-21
RICHARDSON: iter=1 Delta =   1.00000000e-02  Delta Finish=  1.00000000e-10 Delta_use=  1.00000000e-02
BICGSTAB: Relative Residuum requested
BICGSTAB: Target Rsd= 1.000000e-04
BICGSTAB: iter 1 r_norm = 6.512983e-02  target = 1.000000e-04 
BICGSTAB: iter 2 r_norm = 1.488459e+00  target = 1.000000e-04 
BICGSTAB: iter 3 r_norm = 1.869288e-02  target = 1.000000e-04 
BICGSTAB: iter 4 r_norm = 9.046827e-02  target = 1.000000e-04 
BICGSTAB: iter 5 r_norm = 6.530680e-03  target = 1.000000e-04 
BICGSTAB: iter 6 r_norm = 8.442263e-03  target = 1.000000e-04 
BICGSTAB: iter 7 r_norm = 2.102320e-03  target = 1.000000e-04 
BICGSTAB: iter 8 r_norm = 1.997315e-03  target = 1.000000e-04 
BICGSTAB: iter 9 r_norm = 7.731876e-04  target = 1.000000e-04 
BICGSTAB: iter 10 r_norm = 6.661457e-04  target = 1.000000e-04 
BICGSTAB: iter 11 r_norm = 4.186846e-04  target = 1.000000e-04 
BICGSTAB: iter 12 r_norm = 8.482786e-04  target = 1.000000e-04 
BICGSTAB: iter 13 r_norm = 7.752908e-04  target = 1.000000e-04 
BICGSTAB: iter 14 r_norm = 5.143060e-04  target = 1.000000e-04 
BICGSTAB: iter 15 r_norm = 1.235128e-04  target = 1.000000e-04 
BICGSTAB: iter 16 r_norm = 1.199492e-04  target = 1.000000e-04 
BICGSTAB: iter 17 r_norm = 1.030780e-04  target = 1.000000e-04 
BICGSTAB: iter 18 r_norm = 1.195428e-04  target = 1.000000e-04 
BICGSTAB: iter 19 r_norm = 9.955906e-05  target = 1.000000e-04 
RICHARDSON: Iter 1   r_norm_sq=2.660359e-05   Target=2.672142e-21
RICHARDSON: iter=2 Delta =   1.00000000e-02  Delta Finish=  1.00221202e-08 Delta_use=  1.00000000e-02
BICGSTAB: Relative Residuum requested
BICGSTAB: Target Rsd= 9.999999e-05
BICGSTAB: iter 1 r_norm = 3.618901e+00  target = 9.999999e-05 
BICGSTAB: iter 2 r_norm = 3.003181e+00  target = 9.999999e-05 
BICGSTAB: iter 3 r_norm = 7.625246e-01  target = 9.999999e-05 
BICGSTAB: iter 4 r_norm = 5.732518e-01  target = 9.999999e-05 
BICGSTAB: iter 5 r_norm = 4.181607e-01  target = 9.999999e-05 
BICGSTAB: iter 6 r_norm = 2.707410e-01  target = 9.999999e-05
...
BICGSTAB: iter 36 r_norm = 1.151582e-03  target = 1.960039e-04 
BICGSTAB: iter 37 r_norm = 2.372841e-04  target = 1.960039e-04 
BICGSTAB: iter 38 r_norm = 3.495893e-04  target = 1.960039e-04 
BICGSTAB: iter 39 r_norm = 2.635901e-04  target = 1.960039e-04 
BICGSTAB: iter 40 r_norm = 2.092025e-04  target = 1.960039e-04 
BICGSTAB: iter 41 r_norm = 1.865860e-04  target = 1.960039e-04 
RICHARDSON: Iter 5   r_norm_sq=2.335266e-21   Target=2.453137e-21
RICHARDSON: Solver converged at iter 5 || r || =   9.75679787e-11
RICHARDSON: Inner MV Apps=399 Outer MV Apps=7 Inner Site Flops=189480 Outer Site Flops=669
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 5 iters,  rsd_sq_final=9.51951046610738e-21
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 9.75679775308125e-11
QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=0.551431894302368 (sec)  Performance=200.937513308294 GFLOPS

So it's fixed?

@kostrzewa, what more checks do you want in order to believe that this issue is resolved?

bjoo commented 7 years ago

HI All, I didn’t do anything to it… I don’t think it ought to have been badly broken.

Best, B

On Nov 6, 2017, at 2:34 PM, Martin Ueding notifications@github.com wrote:

This is the output of Chroma, running with one MPI process and 256 threads on a single KNL:

Veclen is 8 Soalen is 8 Inner Veclen is 16 Inner Soalen is 8 ... Starting solve RICHARDSON: RelativeResiduum requested RICHARDSON: Initial || r ||^2 = 2.67214178e-01 Target || r ||^2 = 2.67214178e-21 RICHARDSON: iter=1 Delta = 1.00000000e-02 Delta Finish= 1.00000000e-10 Delta_use= 1.00000000e-02 BICGSTAB: Relative Residuum requested BICGSTAB: Target Rsd= 1.000000e-04 BICGSTAB: iter 1 r_norm = 6.512983e-02 target = 1.000000e-04 BICGSTAB: iter 2 r_norm = 1.488459e+00 target = 1.000000e-04 BICGSTAB: iter 3 r_norm = 1.869288e-02 target = 1.000000e-04 BICGSTAB: iter 4 r_norm = 9.046827e-02 target = 1.000000e-04 BICGSTAB: iter 5 r_norm = 6.530680e-03 target = 1.000000e-04 BICGSTAB: iter 6 r_norm = 8.442263e-03 target = 1.000000e-04 BICGSTAB: iter 7 r_norm = 2.102320e-03 target = 1.000000e-04 BICGSTAB: iter 8 r_norm = 1.997315e-03 target = 1.000000e-04 BICGSTAB: iter 9 r_norm = 7.731876e-04 target = 1.000000e-04 BICGSTAB: iter 10 r_norm = 6.661457e-04 target = 1.000000e-04 BICGSTAB: iter 11 r_norm = 4.186846e-04 target = 1.000000e-04 BICGSTAB: iter 12 r_norm = 8.482786e-04 target = 1.000000e-04 BICGSTAB: iter 13 r_norm = 7.752908e-04 target = 1.000000e-04 BICGSTAB: iter 14 r_norm = 5.143060e-04 target = 1.000000e-04 BICGSTAB: iter 15 r_norm = 1.235128e-04 target = 1.000000e-04 BICGSTAB: iter 16 r_norm = 1.199492e-04 target = 1.000000e-04 BICGSTAB: iter 17 r_norm = 1.030780e-04 target = 1.000000e-04 BICGSTAB: iter 18 r_norm = 1.195428e-04 target = 1.000000e-04 BICGSTAB: iter 19 r_norm = 9.955906e-05 target = 1.000000e-04 RICHARDSON: Iter 1 r_norm_sq=2.660359e-05 Target=2.672142e-21 RICHARDSON: iter=2 Delta = 1.00000000e-02 Delta Finish= 1.00221202e-08 Delta_use= 1.00000000e-02 BICGSTAB: Relative Residuum requested BICGSTAB: Target Rsd= 9.999999e-05 BICGSTAB: iter 1 r_norm = 3.618901e+00 target = 9.999999e-05 BICGSTAB: iter 2 r_norm = 3.003181e+00 target = 9.999999e-05 BICGSTAB: iter 3 r_norm = 7.625246e-01 target = 9.999999e-05 BICGSTAB: iter 4 r_norm = 5.732518e-01 target = 9.999999e-05 BICGSTAB: iter 5 r_norm = 4.181607e-01 target = 9.999999e-05 BICGSTAB: iter 6 r_norm = 2.707410e-01 target = 9.999999e-05 ... BICGSTAB: iter 36 r_norm = 1.151582e-03 target = 1.960039e-04 BICGSTAB: iter 37 r_norm = 2.372841e-04 target = 1.960039e-04 BICGSTAB: iter 38 r_norm = 3.495893e-04 target = 1.960039e-04 BICGSTAB: iter 39 r_norm = 2.635901e-04 target = 1.960039e-04 BICGSTAB: iter 40 r_norm = 2.092025e-04 target = 1.960039e-04 BICGSTAB: iter 41 r_norm = 1.865860e-04 target = 1.960039e-04 RICHARDSON: Iter 5 r_norm_sq=2.335266e-21 Target=2.453137e-21 RICHARDSON: Solver converged at iter 5 || r || = 9.75679787e-11 RICHARDSON: Inner MV Apps=399 Outer MV Apps=7 Inner Site Flops=189480 Outer Site Flops=669 QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: 5 iters, rsd_sq_final=9.51951046610738e-21 QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: || r || / || b || = 9.75679775308125e-11 QPHIX_CLOVER_BICGSTAB_ITER_REFINE_SOLVER: Solver Time=0.551431894302368 (sec) Performance=200.937513308294 GFLOPS

So it's fixed?

@kostrzewa, what more checks do you want in order to believe that this issue is resolved?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

martin-ueding commented 7 years ago

We changed a bunch of things, so this could have happen from a lot of different commits or branches that we had. If it is fixed now, I would be very happy about that. But I would think that there are still some doubts that need to be resolved before I can mark this as fixed.

kostrzewa commented 7 years ago

@martin-ueding in your final test, was your local lattice volume 16x8^3 ? I'll give a try with the current devel to see if it works for me too.

martin-ueding commented 7 years ago

My tests were with the 24³×96 lattice from our production. However, iirc I was able to reproduce the error with that lattice as well. So the bug might not depend so strongly on the local lattice size. But please run your test, if it is gone there as well, we should be good to go.

At least this is a loud error, so this will not go by unnoticed.

kostrzewa commented 6 years ago

I can confirm that this has magically fixed itself for the test case I presented above. Closing! Thanks @martin-ueding for all the tests!

JeffersonLab / qphix