DDalphaAMG_nd branch convergence issues

kostrzewa commented 7 years ago

@sbacchio @Finkenrath Over the last few days I've had some time to try to understand an issue which has been bugging me a bit because I would like to run with the TM2p1p1 branch of sbacchio/DDalphaAMG and the corresponding head commit of the DDalphaAMG_nd branch of Finkenrath/tmLQCD to help with convergence in the heavy sector. However, I'm finding severe convergence problems and further issues. First a comparison to a working setup:

When I set up the head commit of the master branch of sbacchio/DDalphaAMG together with the the head commit of the master branch of FInkenrath/tmLQCD, I get great convergence in the light sector and the expected iteration counts for the given aggregation and scale parameters.

Doing the same with the aforementioned branches for 2+1+1 results in solves which do not converge and output which I have not seen before:

----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 12  6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 4   2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 2   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007827+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008543+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008483+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.008835+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009761+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.072938 seconds
depth: 1, time spent for setting up next coarser operator: 0.057122 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.063018 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.057935 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.082971 seconds

performed 5 iterative setup steps
elapsed time: 13.714705 seconds (2.121091 seconds on coarse grid)

DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites 
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008605+0.000000i
| approx. rel. res. after  1      iterations: 2.686549e-02 |
| approx. rel. res. after  2      iterations: 9.386865e-03 |
| approx. rel. res. after  3      iterations: 3.141994e-03 |
| approx. rel. res. after  4      iterations: 1.246548e-03 |
| approx. rel. res. after  5      iterations: 4.854671e-04 |
| approx. rel. res. after  6      iterations: 1.898306e-04 |
| approx. rel. res. after  7      iterations: 7.727864e-05 |
| approx. rel. res. after  8      iterations: 3.056149e-05 |
| approx. rel. res. after  9      iterations: 1.221386e-05 |
| approx. rel. res. after  10     iterations: 4.911786e-06 |
| approx. rel. res. after  11     iterations: 1.944398e-06 |
| approx. rel. res. after  12     iterations: 7.717114e-07 |
| approx. rel. res. after  13     iterations: 3.055015e-07 |
| approx. rel. res. after  14     iterations: 1.214677e-07 |
| approx. rel. res. after  15     iterations: 4.836682e-08 |
| approx. rel. res. after  16     iterations: 1.907075e-08 |
| approx. rel. res. after  17     iterations: 7.568452e-09 |
| approx. rel. res. after  18     iterations: 3.016249e-09 |
| approx. rel. res. after  19     iterations: 1.199059e-09 |
| approx. rel. res. after  20     iterations: 4.778359e-10 |
| approx. rel. res. after  21     iterations: 1.885605e-10 |
| approx. rel. res. after  22     iterations: 7.484878e-11 |
| approx. rel. res. after  23     iterations: 2.994289e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 23     coarse average: 3.96     |
| exact relative residual: ||r||/||b|| = 2.994289e-11      |
| elapsed wall clock time: 14.0737  seconds                |
|        coarse grid time: 6.6641   seconds (47.4%)        |
|  consumed core minutes*: 6.00e+01 (solve only)           |
|    max used mem/MPIproc: 1.93e-01 GB                     |
+----------------------------------------------------------+

To compare, the working setup looks like this:

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 12  6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 4   2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 2   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.554985 seconds
depth: 1, time spent for setting up next coarser operator: 0.043112 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.044045 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.558093 seconds
depth: 1, time spent for setting up next coarser operator: 0.045157 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.031808 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.556642 seconds
[...]
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.029956 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.556980 seconds
depth: 1, time spent for setting up next coarser operator: 0.059933 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.028399 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.028356 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.033057 seconds

performed 5 iterative setup steps
elapsed time: 25.091544 seconds (12.091816 seconds on coarse grid)

DDalphaAMG setup ran, time 27.47 sec (44.02 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 2.979074e-02 |
| approx. rel. res. after  2      iterations: 8.042268e-03 |
| approx. rel. res. after  3      iterations: 1.584980e-03 |
| approx. rel. res. after  4      iterations: 3.340151e-04 |
| approx. rel. res. after  5      iterations: 7.525576e-05 |
| approx. rel. res. after  6      iterations: 1.551435e-05 |
| approx. rel. res. after  7      iterations: 3.158749e-06 |
| approx. rel. res. after  8      iterations: 7.007767e-07 |
| approx. rel. res. after  9      iterations: 1.494747e-07 |
| approx. rel. res. after  10     iterations: 3.354428e-08 |
| approx. rel. res. after  11     iterations: 7.172643e-09 |
| approx. rel. res. after  12     iterations: 1.493532e-09 |
| approx. rel. res. after  13     iterations: 3.296716e-10 |
| approx. rel. res. after  14     iterations: 7.064493e-11 |
| approx. rel. res. after  15     iterations: 1.588326e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 15     coarse average: 15.67    |
| exact relative residual: ||r||/||b|| = 1.588326e-11      |
| elapsed wall clock time: 1.5327   seconds                |
|        coarse grid time: 0.6300   seconds (41.1%)        |
|  consumed core minutes*: 6.54e+00 (solve only)           |
|    max used mem/MPIproc: 1.29e-01 GB                     |
+----------------------------------------------------------+

and is significantly faster, as you can see.

Have you seen this behaviour?

sbacchio commented 7 years ago

I think the problem is here:

DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites

there is a big change in mu on the odd sites.. somehow in the setup phase a wrong g_mu3 is used. What is your input file? Which executable are you using?

kostrzewa commented 7 years ago

This is in the HMC, so I would expect that the problematic output is actually correct, this is for the following setup:

BeginDDalphaAMG
  MGBlockX = 3
  MGBlockY = 3
  MGBlockZ = 3
  MGBlockT = 3
  MGSetupIter = 5
  MGCoarseSetupIter = 3
  MGNumberOfVectors = 20
  MGNumberOfLevels = 3
  MGCoarseMuFactor = 3
  MGdtauUpdate = 0.0624
  MGUpdateSetupIter = 1
  MGOMPNumThreads = 1
EndDDalphaAMG

and the following monomial triggers the first solve with ddalphaamg

BeginMonomial CLOVERDETRATIO
  Timescale = 2
  kappa = 0.1400645
  2KappaMu = 0.001120516
  # numerator shift
  rho = 0.02016936
  # denominator shift, should match CLOVERDET shift
  rho2 = 0.10420836
  CSW = 1.74
  MaxSolverIterations = 60000
  AcceptancePrecision =  1.e-21
  ForcePrecision = 1.e-18
  Name = cloverdetratio1light
  solver = ddalphaamg
EndMonomial

When I use the problematic version in invert to find optimal parameters, I get the same problems:

DDalphaAMG cnfg set, plaquette 5.432070e-01
DDalphaAMG running setup
initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 0.919600 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.021723 seconds

initial coarse grid correction is defined
elapsed time: 8.644193 seconds

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 12  6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 4                               |
|            test vectors: 24                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 4   2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 2   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.028000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007809+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009952+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009985+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009946+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.105289 seconds
depth: 1, time spent for setting up next coarser operator: 0.918843 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.019513 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.019073 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.021341 seconds

performed 4 iterative setup steps
elapsed time: 21.283853 seconds (2.872039 seconds on coarse grid)

DDalphaAMG setup ran, time 29.94 sec (9.59 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 6.601027e-02 |
| approx. rel. res. after  2      iterations: 3.342283e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009991+0.000000i
| approx. rel. res. after  3      iterations: 2.425828e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009872+0.000000i
| approx. rel. res. after  4      iterations: 1.956784e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009959+0.000000i
| approx. rel. res. after  5      iterations: 1.715145e-02 |
[...] -> no convergence before 600 iterations

while the master branch works rather better

initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 1.172197 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.116010 seconds

initial coarse grid correction is defined
elapsed time: 4.875110 seconds

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 6   6   12  12                  |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 4                               |
|            test vectors: 24                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 2   2   4   4                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 1   1   2   2                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.028000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 1.151630 seconds
depth: 1, time spent for setting up next coarser operator: 0.109204 seconds
depth: 1, bootstrap step number 1...
[...]
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.104408 seconds

performed 4 iterative setup steps
elapsed time: 62.497544 seconds (38.668907 seconds on coarse grid)

DDalphaAMG setup ran, time 67.38 sec (57.39 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 5.388873e-02 |
| approx. rel. res. after  2      iterations: 1.388262e-02 |
| approx. rel. res. after  3      iterations: 3.364761e-03 |
| approx. rel. res. after  4      iterations: 8.359057e-04 |
| approx. rel. res. after  5      iterations: 1.990664e-04 |
| approx. rel. res. after  6      iterations: 4.952127e-05 |
| approx. rel. res. after  7      iterations: 1.263903e-05 |
| approx. rel. res. after  8      iterations: 3.351799e-06 |
| approx. rel. res. after  9      iterations: 8.567047e-07 |
| approx. rel. res. after  10     iterations: 2.091744e-07 |
| approx. rel. res. after  11     iterations: 5.094827e-08 |
| approx. rel. res. after  12     iterations: 1.216494e-08 |
| approx. rel. res. after  13     iterations: 2.904565e-09 |
| approx. rel. res. after  14     iterations: 6.856662e-10 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 14     coarse average: 292.79   |
| exact relative residual: ||r||/||b|| = 6.856662e-10      |
| elapsed wall clock time: 6.2996   seconds                |
|        coarse grid time: 4.8121   seconds (76.4%)        |
|  consumed core minutes*: 1.34e+01 (solve only)           |
|    max used mem/MPIproc: 2.78e-01 GB                     |
+----------------------------------------------------------+

Note that between the two runs above, there is a factor of two in the number of processes. However, I see the same problems with the same number of processes, I just don't have results for the particular, exemplary set of parameters.

sbacchio commented 7 years ago

Hmm I don't like it. It's something we didn't notice on the runs for the Nf=2+1+1 ensamble. And we use the same package setup.

Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.

I will check the changes I did and try to come out with some idea

kostrzewa commented 7 years ago

Is it because I haven't specified MGNumberOfShifts = 4 ?

kostrzewa commented 7 years ago

Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.

Will test this out.

kostrzewa commented 7 years ago

It seems that the problem is in DDalphaAMG, rather than the interface. Using the master branch of Finkenrath/tmLQCD together with the TM2p1p1 branch of sbacchio/DDalphaAMG has the same problems as described above:

Problematic:

depth: 1, iter: 1, p->H(1,0) = +0.009670+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009650+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009739+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.073741 seconds
depth: 1, time spent for setting up next coarser operator: 0.042813 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.048123 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036359 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.130586 seconds

performed 5 iterative setup steps
elapsed time: 13.709875 seconds (2.341967 seconds on coarse grid)

DDalphaAMG setup ran, time 15.94 sec (14.69 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites 
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites 
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008553+0.000000i
| approx. rel. res. after  1      iterations: 2.693876e-02 |
| approx. rel. res. after  2      iterations: 9.422674e-03 |
| approx. rel. res. after  3      iterations: 3.136621e-03 |
| approx. rel. res. after  4      iterations: 1.244779e-03 |
| approx. rel. res. after  5      iterations: 4.886695e-04 |
| approx. rel. res. after  6      iterations: 1.909823e-04 |
| approx. rel. res. after  7      iterations: 7.708101e-05 |
| approx. rel. res. after  8      iterations: 3.028029e-05 |
| approx. rel. res. after  9      iterations: 1.209484e-05 |
| approx. rel. res. after  10     iterations: 4.876731e-06 |
| approx. rel. res. after  11     iterations: 1.936528e-06 |
| approx. rel. res. after  12     iterations: 7.807262e-07 |
| approx. rel. res. after  13     iterations: 3.124696e-07 |
| approx. rel. res. after  14     iterations: 1.238244e-07 |
| approx. rel. res. after  15     iterations: 4.957753e-08 |
| approx. rel. res. after  16     iterations: 1.986782e-08 |
| approx. rel. res. after  17     iterations: 7.987017e-09 |
| approx. rel. res. after  18     iterations: 3.190570e-09 |
| approx. rel. res. after  19     iterations: 1.264548e-09 |
| approx. rel. res. after  20     iterations: 5.055527e-10 |
| approx. rel. res. after  21     iterations: 2.021383e-10 |
| approx. rel. res. after  22     iterations: 8.120851e-11 |
| approx. rel. res. after  23     iterations: 3.276034e-11 |
| approx. rel. res. after  24     iterations: 1.314241e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 24     coarse average: 3.96     |
| exact relative residual: ||r||/||b|| = 1.314241e-11      |
| elapsed wall clock time: 10.9579  seconds                |
|        coarse grid time: 6.8740   seconds (62.7%)        |
|  consumed core minutes*: 4.68e+01 (solve only)           |
|    max used mem/MPIproc: 1.93e-01 GB                     |
+----------------------------------------------------------+

Unproblematic (master + master):

depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.039350 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036769 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.034938 seconds

performed 5 iterative setup steps
elapsed time: 26.075408 seconds (12.331877 seconds on coarse grid)

DDalphaAMG setup ran, time 28.19 sec (43.75 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites 
+----------------------------------------------------------+
| approx. rel. res. after  1      iterations: 2.981504e-02 |
| approx. rel. res. after  2      iterations: 8.115737e-03 |
| approx. rel. res. after  3      iterations: 1.613163e-03 |
| approx. rel. res. after  4      iterations: 3.403916e-04 |
| approx. rel. res. after  5      iterations: 6.901793e-05 |
| approx. rel. res. after  6      iterations: 1.509629e-05 |
| approx. rel. res. after  7      iterations: 3.174391e-06 |
| approx. rel. res. after  8      iterations: 6.519720e-07 |
| approx. rel. res. after  9      iterations: 1.452323e-07 |
| approx. rel. res. after  10     iterations: 3.097001e-08 |
| approx. rel. res. after  11     iterations: 6.925372e-09 |
| approx. rel. res. after  12     iterations: 1.462020e-09 |
| approx. rel. res. after  13     iterations: 3.030030e-10 |
| approx. rel. res. after  14     iterations: 6.678557e-11 |
| approx. rel. res. after  15     iterations: 1.420444e-11 |
+----------------------------------------------------------+

+----------------------------------------------------------+
|       FGMRES iterations: 15     coarse average: 16.67    |
| exact relative residual: ||r||/||b|| = 1.420444e-11      |
| elapsed wall clock time: 1.6075   seconds                |
|        coarse grid time: 0.5843   seconds (36.3%)        |
|  consumed core minutes*: 6.86e+00 (solve only)           |
|    max used mem/MPIproc: 1.29e-01 GB                     |
+----------------------------------------------------------+

kostrzewa commented 7 years ago

@sunpho84 This could be the reason why your test simulation on Marconi A2 was even slower than expected and why inversions were not converging if done outside of the HMC. If I remember correctly, we set up the TM2p1p1 branch of DDalphaAMG as well as the DDalphaAMG_nd branch of tmLQCD, correct?

sunpho84 commented 7 years ago

Yes I was using your suggestion, that is:

https://github.com/Finkenrath/tmLQCD/tree/DDalphaAMG_nd

linked against

https://github.com/sbacchio/DDalphaAMG/commits/TM2p1p1

sbacchio commented 7 years ago

Ok I will work on this starting from today.. What I guess is that I broke the e/o preconditioning for the smoother when an odd sized block is used. The point is that everything is working fine on our runs and I've never noticed convergence issues.. So the problem should be in some "special" case that I didn't check.

@kostrzewa For confirming that, could you please try to run with an even sized block? like 4 3 3 3?

Thanks!

kostrzewa commented 7 years ago

Would 6x4x4x4 be okay too?

kostrzewa commented 7 years ago

sorry, I meant 6x3x3x3

sbacchio commented 7 years ago

yes should be fine! and then maybe we should try to turn off the e/o and then the SSE.

The e/o you turn it off changing line 989 of init.c in DDalphaAMG. And instead the SSE you turn it off from the make file.

kostrzewa commented 7 years ago

So with 6x3x3x3 I get the same p->H(1,0) messages which I had not seen before.

kostrzewa commented 7 years ago

warning: The SSE implementation is based on the odd-even preconditioned code.    
         Switch on odd-even preconditioning in the input file.
error: assertion "g.odd_even" failed (build/gsrc/init.c:1092)
       bad choice of input parameters (please read the user manual in /doc).

So I need to disable both SSE and e/o.

kostrzewa commented 7 years ago

And that fails:

build/gsrc/coarse_operator_float.c(47): error: identifier "SIMD_LENGTH_float" is undefined
      int column_offset = 2*SIMD_LENGTH_float*((l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
                            ^

build/gsrc/coarse_operator_float.c(55): error: identifier "SIMD_LENGTH_float" is undefined
      int column_offset = SIMD_LENGTH_float*((2*l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
                          ^

kostrzewa commented 7 years ago

Trying a clean build.

kostrzewa commented 7 years ago

Nope.

kostrzewa commented 7 years ago

@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)

sbacchio commented 7 years ago

Ah right, clear! I forgot about that.. the SSE is based on the e/o. Removing both should work: e/o = 0 and Makefile without -DSSE in OPT_VERSION_FLAGS. Since you are editing the Makefile, could you please also enable the -DDEBUG in OPT_VERSION_FLAGS?

I'm really sorry to make you try things, but I've not been able to replicate your problem.

kostrzewa commented 7 years ago

I tried to disable both e/o and SSE, the result is that SIMD_LENGTH_float is undefined...

sunpho84 commented 7 years ago

@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)

I thought that the TM2p1p1 was the correct one for nf=2+1+1?

kostrzewa commented 7 years ago

Well, yes, but if you don't run with DDalphaAMG in the heavy sector, then you don't need the extra stuff.

kostrzewa commented 7 years ago

@sbacchio Okay, I think I might have to give up for now. I think there might be a compiler issue on the machine that I was trying this on.

@sunpho84 On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.

kostrzewa commented 7 years ago

@sbacchio So you tried to reproduce this on a 24c48 lattice with the 3x3x3x3 aggregation ? If you can't reproduce it, then the problem is probably on my side. There are some odd things going on on the machine that I tried. If I get a chance, I'll compile with GCC to see if that works.

sunpho84 commented 7 years ago

@sunpho84 On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.

yes in the old logs, see e.g. /marconi_work/INF17_lqcd123_0/sanfo/hmcnf2p1p1/A40.40/logs/log_mg_1490524967

then I've tried a few variations of the settings (following some sbacchio's suggestion) and this warning disappeared, see the logs out from logs/ folder

sbacchio commented 7 years ago

Sorry yesterday I had to leave early.

So I've pushed now a version which can be compiled without SSE and that have a possible bug fix.. I'm trying to compare the two versions, but I did so many changes that is hard to find the right place.

@sunpho84 can you remind me what are the differences between before and after having p->H(1,0)?

@kostrzewa I didn't have exactly that configuration, but trying with what I have I've not been able to reproduce the p->H(1,0) warning.

sunpho84 commented 7 years ago

It looks to me as if it was happening on a random basis. Here you have a sample:

+----------------------------------------------------------+
| 2-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.937588                       |
|                     csw: +0.000000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 80  40  40  40                  |
|           local lattice: 4   10  10  10                  |
|           block lattice: 4   5   5   5                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 24                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 20  8   8   8                   |
|           local lattice: 1   2   2   2                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +nan+0.000000i
[...]

sbacchio commented 7 years ago

Ok I confirm that the construction of the coarse operator is broken when odd size is used in the fastest running index.

There are two solutions at the moment:

either to use a block size which is even in X

or comment in the file vectorization_control.h the lines

#define INTERPOLATION_OPERATOR_LAYOUT_OPTIMIZED_float
#define INTERPOLATION_SETUP_LAYOUT_OPTIMIZED_float

I hope to solve it today!

sbacchio commented 7 years ago

It should be fixed.

@kostrzewa can you check if now it works? :)

kostrzewa commented 7 years ago

@sbacchio I'm checking this now, thanks!

kostrzewa commented 7 years ago

It seems like it works. I don't fully understand, however, what happens in the following situation:

Trying to run 24c48 using 512 MPI processes with a 3-level setup and 3^4 aggregates. Parallelisation is 8x4x4x4. Naively I would expect that a 3-level setup would even be impossible.

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 6   6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 16  2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 8   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

The corresponding 2-level setup works fine, so I think this bug has been fixed. What I don't understand is the output for the second level of the above 3-level setup.

kostrzewa commented 7 years ago

@sbacchio The MMS solver in DDalphaAMG and the tmLQCD DDalphaAMG_nd interface seem to have diverged. I was testing the 2p1p1 branch of DDalphaAMG using the standard interface (Finkenrath/tmLQCD/DDalphaAMG) without the calls to the 1+1 functions, so I did not notice up to now.

The pointer to the array of tolerances that is passed has not been implemented in the interface. (see, for instance, DDalphaAMG_solve_ms_doublet_squared_odd.

kostrzewa commented 7 years ago

I've set up a pull-request for this at https://github.com/Finkenrath/tmLQCD/pull/8

sbacchio commented 7 years ago

Sorry I missed your message the other day

kostrzewa commented 7 years ago

@sbacchio I think there may be other problems, related to what I wrote above in https://github.com/etmc/tmLQCD/issues/362#issuecomment-298159660

If you look at the output below, the lattice dimensions seem to be completely sensible for a 3-level, 4^2*3^2 setup. However, I believe that something weird happens when the blocking is done, wouldn't you say? The run crashes during setup with a segmentation fault.

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430229                       |
|                     csw: +1.740000                       |
|                      mu: +0.001200                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 96  48  48  48                  |
|           local lattice: 8   6   6   8                   |
|           block lattice: 4   3   3   4                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 24  16  16  12                  |
|           local lattice: 24  2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 12  8   8   6                   |
|           local lattice: 12  1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.011400                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

kostrzewa commented 7 years ago

Or would I need at least four blocks in any one of the lattice dimensions, such that even-odd on the third level works?

sbacchio commented 7 years ago

Oufff... No something like that should work! At what point of the setup the seg. fault appears?

On 4 May 2017 at 08:32, Bartosz Kostrzewa notifications@github.com wrote:

Or would I need at least four blocks in any one of the lattice dimensions, such that even-odd on the third level works?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etmc/tmLQCD/issues/362#issuecomment-299100026, or mute the thread https://github.com/notifications/unsubscribe-auth/AQC6BFFfCcIyBZdQfUHyPLY6v1ACFupjks5r2WLfgaJpZM4M_Fj- .

kostrzewa commented 7 years ago

I'll check when I get to the office.

So, just to understand: if on the second level it has 2x2x2x2 local lattice points, would it automatically aggegate only in three dimensions to have 2x1x1x1 on the coarsest level? Or would it aggegate down to 1x1x1x1 on the coarsest level and simply skip every second MPI process when working on the coarsest level to do even-odd?

kostrzewa commented 7 years ago

depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.196736 seconds
depth: 1, time spent for setting up next coarser operator: 0.266321 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.258096 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.169192 seconds
depth: 1, time spent for setting up next coarser operator: 0.268444 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.258697 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.171508 seconds
depth: 1, time spent for setting up next coarser operator: 0.268366 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.261473 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.262536 seconds
depth: 0, bootstrap step number 4...
depth: 0, time spent for setting up next coarser operator: 0.257923 seconds
depth: 1, time spent for setting up next coarser operator: 0.263163 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.257366 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.258957 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.168741 seconds

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 18571 RUNNING AT r076c06s04-hfi.marconi.cineca.it
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

I'm happy to debug this some more. Would you get more info with some of the debug compiler flags for DDalphaAMG?

kostrzewa commented 7 years ago

On the 24c48 lattice that I mentioned in https://github.com/etmc/tmLQCD/issues/362#issuecomment-298159660, the situation is analogous:

+----------------------------------------------------------+
| 3-level method                                           |
| postsmoothing K-cycle                                    |
| FGMRES + red-black multiplicative Schwarz                |
|          restart length: 30                              |
|                      m0: -0.430216                       |
|                     csw: +1.740000                       |
|                      mu: +0.004000                       |
+----------------------------------------------------------+
|   preconditioner cycles: 1                               |
|            inner solver: minimal residual iteration      |
|               precision: single                          |
+---------------------- depth  0 --------------------------+
|          global lattice: 48  24  24  24                  |
|           local lattice: 6   6   6   6                   |
|           block lattice: 3   3   3   3                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 5                               |
|            test vectors: 20                              |
+---------------------- depth  1 --------------------------+
|          global lattice: 16  8   8   8                   |
|           local lattice: 16  2   2   2                   |
|           block lattice: 2   2   2   2                   |
|        post smooth iter: 4                               |
|     smoother inner iter: 4                               |
|              setup iter: 3                               |
|            test vectors: 28                              |
+---------------------- depth  2 --------------------------+
|          global lattice: 8   4   4   4                   |
|           local lattice: 8   1   1   1                   |
|           block lattice: 1   1   1   1                   |
|      coarge grid solver: odd even GMRES                  |
|              iterations: 200                             |
|                  cycles: 10                              |
|               tolerance: 1e-01                           |
|                      mu: +0.012000                       |
+----------------------------------------------------------+
|          K-cycle length: 5                               |
|        K-cycle restarts: 2                               |
|       K-cycle tolerance: 1e-01                           |
+----------------------------------------------------------+

depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.101174 seconds
depth: 1, time spent for setting up next coarser operator: 0.174573 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.173764 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.100405 seconds
depth: 1, time spent for setting up next coarser operator: 0.171963 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.171607 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.101836 seconds
depth: 1, time spent for setting up next coarser operator: 0.175991 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.172342 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.171551 seconds
depth: 0, bootstrap step number 4...
depth: 0, time spent for setting up next coarser operator: 0.101485 seconds
depth: 1, time spent for setting up next coarser operator: 0.173913 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.171866 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.172153 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.099945 seconds

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 5407 RUNNING AT r079c02s02-hfi.marconi.cineca.it
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

sbacchio commented 7 years ago

So, just to understand: if on the second level it has 2x2x2x2 local lattice points, it would aggegate only in three dimensions to have 2x1x1x1 on the coarsest level? Or would it aggegate down to 1x1x1x1 on the coarsest level and simply idle one MPI process to do even-odd?

yes, since odd-even is enabled you need to have an even local volume on the coarsest grid.

ah ok, the bug is not in the setup phase, but in something done just after... Or better it's during the last setup iteration.. Mmm I should try to replicate the problem in order to study it here.

I will check on it.

kostrzewa commented 7 years ago

yes, since odd-even is enabled you need to have an even local volume on the coarsest grid.

sorry, but yes to which question?

sbacchio commented 7 years ago

The rule is that you need at least a factor of 2 in the coarsest local lattice.

So yes, it should aggregate just in three directions. I understand that from tmLQCD you don't have control on the coarse block lattice, but it should be done by itself. If it doesn't work, it is easy to fix or expose the coarse block lattice.

On 4 May 2017 at 12:16, Bartosz Kostrzewa notifications@github.com wrote:

yes, since odd-even is enabled you need to have an even local volume on the coarsest grid.

sorry, but yes to which question?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etmc/tmLQCD/issues/362#issuecomment-299134502, or mute the thread https://github.com/notifications/unsubscribe-auth/AQC6BJAJMr6Y6D4ABQTIZGJRZ6cRt67Uks5r2ZdXgaJpZM4M_Fj- .

kostrzewa commented 7 years ago

Okay, thanks. I think it would be quite helpful if this worked at some point.

etmc / tmLQCD

DDalphaAMG_nd branch convergence issues #362