Open kostrzewa opened 7 years ago
I think the problem is here:
DDalphaAMG setup ran, time 15.61 sec (13.59 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites
there is a big change in mu on the odd sites.. somehow in the setup phase a wrong g_mu3 is used. What is your input file? Which executable are you using?
This is in the HMC, so I would expect that the problematic output is actually correct, this is for the following setup:
BeginDDalphaAMG
MGBlockX = 3
MGBlockY = 3
MGBlockZ = 3
MGBlockT = 3
MGSetupIter = 5
MGCoarseSetupIter = 3
MGNumberOfVectors = 20
MGNumberOfLevels = 3
MGCoarseMuFactor = 3
MGdtauUpdate = 0.0624
MGUpdateSetupIter = 1
MGOMPNumThreads = 1
EndDDalphaAMG
and the following monomial triggers the first solve with ddalphaamg
BeginMonomial CLOVERDETRATIO
Timescale = 2
kappa = 0.1400645
2KappaMu = 0.001120516
# numerator shift
rho = 0.02016936
# denominator shift, should match CLOVERDET shift
rho2 = 0.10420836
CSW = 1.74
MaxSolverIterations = 60000
AcceptancePrecision = 1.e-21
ForcePrecision = 1.e-18
Name = cloverdetratio1light
solver = ddalphaamg
EndMonomial
When I use the problematic version in invert
to find optimal parameters, I get the same problems:
DDalphaAMG cnfg set, plaquette 5.432070e-01
DDalphaAMG running setup
initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 0.919600 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.021723 seconds
initial coarse grid correction is defined
elapsed time: 8.644193 seconds
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 12 6 6 6 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 4 |
| test vectors: 24 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 4 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 2 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.028000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +0.007809+0.000000i
[...]
depth: 1, iter: 1, p->H(1,0) = +0.009952+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009985+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009946+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.105289 seconds
depth: 1, time spent for setting up next coarser operator: 0.918843 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.019513 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.019073 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.021341 seconds
performed 4 iterative setup steps
elapsed time: 21.283853 seconds (2.872039 seconds on coarse grid)
DDalphaAMG setup ran, time 29.94 sec (9.59 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 6.601027e-02 |
| approx. rel. res. after 2 iterations: 3.342283e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009991+0.000000i
| approx. rel. res. after 3 iterations: 2.425828e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009872+0.000000i
| approx. rel. res. after 4 iterations: 1.956784e-02 |
depth: 1, iter: 1, p->H(1,0) = +0.009959+0.000000i
| approx. rel. res. after 5 iterations: 1.715145e-02 |
[...] -> no convergence before 600 iterations
while the master branch works rather better
initial definition --- depth: 0
depth: 0, time spent for setting up next coarser operator: 1.172197 seconds
initial definition --- depth: 1
depth: 1, time spent for setting up next coarser operator: 0.116010 seconds
initial coarse grid correction is defined
elapsed time: 4.875110 seconds
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 6 6 12 12 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 4 |
| test vectors: 24 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 2 2 4 4 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 1 1 2 2 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.028000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 1.151630 seconds
depth: 1, time spent for setting up next coarser operator: 0.109204 seconds
depth: 1, bootstrap step number 1...
[...]
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.104408 seconds
performed 4 iterative setup steps
elapsed time: 62.497544 seconds (38.668907 seconds on coarse grid)
DDalphaAMG setup ran, time 67.38 sec (57.39 % on coarse grid)
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 5.388873e-02 |
| approx. rel. res. after 2 iterations: 1.388262e-02 |
| approx. rel. res. after 3 iterations: 3.364761e-03 |
| approx. rel. res. after 4 iterations: 8.359057e-04 |
| approx. rel. res. after 5 iterations: 1.990664e-04 |
| approx. rel. res. after 6 iterations: 4.952127e-05 |
| approx. rel. res. after 7 iterations: 1.263903e-05 |
| approx. rel. res. after 8 iterations: 3.351799e-06 |
| approx. rel. res. after 9 iterations: 8.567047e-07 |
| approx. rel. res. after 10 iterations: 2.091744e-07 |
| approx. rel. res. after 11 iterations: 5.094827e-08 |
| approx. rel. res. after 12 iterations: 1.216494e-08 |
| approx. rel. res. after 13 iterations: 2.904565e-09 |
| approx. rel. res. after 14 iterations: 6.856662e-10 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 14 coarse average: 292.79 |
| exact relative residual: ||r||/||b|| = 6.856662e-10 |
| elapsed wall clock time: 6.2996 seconds |
| coarse grid time: 4.8121 seconds (76.4%) |
| consumed core minutes*: 1.34e+01 (solve only) |
| max used mem/MPIproc: 2.78e-01 GB |
+----------------------------------------------------------+
Note that between the two runs above, there is a factor of two in the number of processes. However, I see the same problems with the same number of processes, I just don't have results for the particular, exemplary set of parameters.
Hmm I don't like it. It's something we didn't notice on the runs for the Nf=2+1+1 ensamble. And we use the same package setup.
Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.
I will check the changes I did and try to come out with some idea
Is it because I haven't specified MGNumberOfShifts = 4
?
Something you could try, but I don't know if it will work, is to link the branch master of tmLQCD to the DDalphaAMG_nd branch of DDalphaAMG. So we check if the problem is in the interface or in the solver.
Will test this out.
It seems that the problem is in DDalphaAMG, rather than the interface. Using the master
branch of Finkenrath/tmLQCD together with the TM2p1p1
branch of sbacchio/DDalphaAMG has the same problems as described above:
Problematic:
depth: 1, iter: 1, p->H(1,0) = +0.009670+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009650+0.000000i
depth: 1, iter: 1, p->H(1,0) = +0.009739+0.000000i
depth: 0, time spent for setting up next coarser operator: 0.073741 seconds
depth: 1, time spent for setting up next coarser operator: 0.042813 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.048123 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036359 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.130586 seconds
performed 5 iterative setup steps
elapsed time: 13.709875 seconds (2.341967 seconds on coarse grid)
DDalphaAMG setup ran, time 15.94 sec (14.69 % on coarse grid)
depth: 0, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 1, mu updated to 0.004000 on even sites and 0.376001 on odd sites
depth: 2, mu updated to 0.012000 on even sites and 1.128004 on odd sites
+----------------------------------------------------------+
depth: 1, iter: 1, p->H(1,0) = +0.008553+0.000000i
| approx. rel. res. after 1 iterations: 2.693876e-02 |
| approx. rel. res. after 2 iterations: 9.422674e-03 |
| approx. rel. res. after 3 iterations: 3.136621e-03 |
| approx. rel. res. after 4 iterations: 1.244779e-03 |
| approx. rel. res. after 5 iterations: 4.886695e-04 |
| approx. rel. res. after 6 iterations: 1.909823e-04 |
| approx. rel. res. after 7 iterations: 7.708101e-05 |
| approx. rel. res. after 8 iterations: 3.028029e-05 |
| approx. rel. res. after 9 iterations: 1.209484e-05 |
| approx. rel. res. after 10 iterations: 4.876731e-06 |
| approx. rel. res. after 11 iterations: 1.936528e-06 |
| approx. rel. res. after 12 iterations: 7.807262e-07 |
| approx. rel. res. after 13 iterations: 3.124696e-07 |
| approx. rel. res. after 14 iterations: 1.238244e-07 |
| approx. rel. res. after 15 iterations: 4.957753e-08 |
| approx. rel. res. after 16 iterations: 1.986782e-08 |
| approx. rel. res. after 17 iterations: 7.987017e-09 |
| approx. rel. res. after 18 iterations: 3.190570e-09 |
| approx. rel. res. after 19 iterations: 1.264548e-09 |
| approx. rel. res. after 20 iterations: 5.055527e-10 |
| approx. rel. res. after 21 iterations: 2.021383e-10 |
| approx. rel. res. after 22 iterations: 8.120851e-11 |
| approx. rel. res. after 23 iterations: 3.276034e-11 |
| approx. rel. res. after 24 iterations: 1.314241e-11 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 24 coarse average: 3.96 |
| exact relative residual: ||r||/||b|| = 1.314241e-11 |
| elapsed wall clock time: 10.9579 seconds |
| coarse grid time: 6.8740 seconds (62.7%) |
| consumed core minutes*: 4.68e+01 (solve only) |
| max used mem/MPIproc: 1.93e-01 GB |
+----------------------------------------------------------+
Unproblematic (master + master):
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.039350 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.036769 seconds
depth: 1, bootstrap step number 3...
depth: 1, time spent for setting up next coarser operator: 0.034938 seconds
performed 5 iterative setup steps
elapsed time: 26.075408 seconds (12.331877 seconds on coarse grid)
DDalphaAMG setup ran, time 28.19 sec (43.75 % on coarse grid)
depth: 0, updating mu to 0.000000 on even sites and 0.000000 on odd sites
depth: 1, updating mu to 0.000000 on even sites and 0.000000 on odd sites
depth: 2, updating mu to 0.000000 on even sites and 0.000000 on odd sites
+----------------------------------------------------------+
| approx. rel. res. after 1 iterations: 2.981504e-02 |
| approx. rel. res. after 2 iterations: 8.115737e-03 |
| approx. rel. res. after 3 iterations: 1.613163e-03 |
| approx. rel. res. after 4 iterations: 3.403916e-04 |
| approx. rel. res. after 5 iterations: 6.901793e-05 |
| approx. rel. res. after 6 iterations: 1.509629e-05 |
| approx. rel. res. after 7 iterations: 3.174391e-06 |
| approx. rel. res. after 8 iterations: 6.519720e-07 |
| approx. rel. res. after 9 iterations: 1.452323e-07 |
| approx. rel. res. after 10 iterations: 3.097001e-08 |
| approx. rel. res. after 11 iterations: 6.925372e-09 |
| approx. rel. res. after 12 iterations: 1.462020e-09 |
| approx. rel. res. after 13 iterations: 3.030030e-10 |
| approx. rel. res. after 14 iterations: 6.678557e-11 |
| approx. rel. res. after 15 iterations: 1.420444e-11 |
+----------------------------------------------------------+
+----------------------------------------------------------+
| FGMRES iterations: 15 coarse average: 16.67 |
| exact relative residual: ||r||/||b|| = 1.420444e-11 |
| elapsed wall clock time: 1.6075 seconds |
| coarse grid time: 0.5843 seconds (36.3%) |
| consumed core minutes*: 6.86e+00 (solve only) |
| max used mem/MPIproc: 1.29e-01 GB |
+----------------------------------------------------------+
@sunpho84 This could be the reason why your test simulation on Marconi A2 was even slower than expected and why inversions were not converging if done outside of the HMC. If I remember correctly, we set up the TM2p1p1 branch of DDalphaAMG as well as the DDalphaAMG_nd branch of tmLQCD, correct?
Yes I was using your suggestion, that is:
https://github.com/Finkenrath/tmLQCD/tree/DDalphaAMG_nd
linked against
Ok I will work on this starting from today.. What I guess is that I broke the e/o preconditioning for the smoother when an odd sized block is used. The point is that everything is working fine on our runs and I've never noticed convergence issues.. So the problem should be in some "special" case that I didn't check.
@kostrzewa For confirming that, could you please try to run with an even sized block? like 4 3 3 3?
Thanks!
Would 6x4x4x4
be okay too?
sorry, I meant 6x3x3x3
yes should be fine! and then maybe we should try to turn off the e/o and then the SSE.
The e/o you turn it off changing line 989 of init.c in DDalphaAMG. And instead the SSE you turn it off from the make file.
So with 6x3x3x3 I get the same p->H(1,0)
messages which I had not seen before.
warning: The SSE implementation is based on the odd-even preconditioned code.
Switch on odd-even preconditioning in the input file.
error: assertion "g.odd_even" failed (build/gsrc/init.c:1092)
bad choice of input parameters (please read the user manual in /doc).
So I need to disable both SSE and e/o.
And that fails:
build/gsrc/coarse_operator_float.c(47): error: identifier "SIMD_LENGTH_float" is undefined
int column_offset = 2*SIMD_LENGTH_float*((l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
^
build/gsrc/coarse_operator_float.c(55): error: identifier "SIMD_LENGTH_float" is undefined
int column_offset = SIMD_LENGTH_float*((2*l->num_parent_eig_vect+SIMD_LENGTH_float-1)/SIMD_LENGTH_float);
^
Trying a clean build.
Nope.
@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)
Ah right, clear! I forgot about that.. the SSE is based on the e/o. Removing both should work: e/o = 0 and Makefile without -DSSE in OPT_VERSION_FLAGS. Since you are editing the Makefile, could you please also enable the -DDEBUG in OPT_VERSION_FLAGS?
I'm really sorry to make you try things, but I've not been able to replicate your problem.
I tried to disable both e/o and SSE, the result is that SIMD_LENGTH_float
is undefined...
@sunpho84 if you're still interested in the A40.40 run (or was it A30.40 ?) you can try with the master branch of sbacchio/DDalphaAMG and the master branch of Finkenrath/tmLQCD. It might be that it works better then. (we also had an odd kind of blocking, correct?)
I thought that the TM2p1p1 was the correct one for nf=2+1+1?
Well, yes, but if you don't run with DDalphaAMG in the heavy sector, then you don't need the extra stuff.
@sbacchio Okay, I think I might have to give up for now. I think there might be a compiler issue on the machine that I was trying this on.
@sunpho84
On Marconi A2, did you see the p->H(1,0) ...
output? I can't remember.
@sbacchio So you tried to reproduce this on a 24c48 lattice with the 3x3x3x3 aggregation ? If you can't reproduce it, then the problem is probably on my side. There are some odd things going on on the machine that I tried. If I get a chance, I'll compile with GCC to see if that works.
@sunpho84 On Marconi A2, did you see the p->H(1,0) ... output? I can't remember.
yes in the old logs, see e.g. /marconi_work/INF17_lqcd123_0/sanfo/hmcnf2p1p1/A40.40/logs/log_mg_1490524967
then I've tried a few variations of the settings (following some sbacchio's suggestion) and this warning disappeared, see the logs out from logs/ folder
Sorry yesterday I had to leave early.
So I've pushed now a version which can be compiled without SSE and that have a possible bug fix.. I'm trying to compare the two versions, but I did so many changes that is hard to find the right place.
@sunpho84 can you remind me what are the differences between before and after having p->H(1,0)
?
@kostrzewa I didn't have exactly that configuration, but trying with what I have I've not been able to reproduce the p->H(1,0)
warning.
It looks to me as if it was happening on a random basis. Here you have a sample:
+----------------------------------------------------------+
| 2-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.937588 |
| csw: +0.000000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 80 40 40 40 |
| local lattice: 4 10 10 10 |
| block lattice: 4 5 5 5 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 24 |
+---------------------- depth 1 --------------------------+
| global lattice: 20 8 8 8 |
| local lattice: 1 2 2 2 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.012000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 1, iter: 1, p->H(1,0) = +nan+0.000000i
[...]
Ok I confirm that the construction of the coarse operator is broken when odd size is used in the fastest running index.
There are two solutions at the moment:
vectorization_control.h
the lines
#define INTERPOLATION_OPERATOR_LAYOUT_OPTIMIZED_float
#define INTERPOLATION_SETUP_LAYOUT_OPTIMIZED_float
I hope to solve it today!
It should be fixed.
@kostrzewa can you check if now it works? :)
@sbacchio I'm checking this now, thanks!
It seems like it works. I don't fully understand, however, what happens in the following situation:
Trying to run 24c48 using 512 MPI processes with a 3-level setup and 3^4 aggregates. Parallelisation is 8x4x4x4. Naively I would expect that a 3-level setup would even be impossible.
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 6 6 6 6 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 5 |
| test vectors: 20 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 16 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 8 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.012000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
The corresponding 2-level setup works fine, so I think this bug has been fixed. What I don't understand is the output for the second level of the above 3-level setup.
@sbacchio The MMS solver in DDalphaAMG and the tmLQCD DDalphaAMG_nd interface seem to have diverged. I was testing the 2p1p1 branch of DDalphaAMG using the standard interface (Finkenrath/tmLQCD/DDalphaAMG) without the calls to the 1+1 functions, so I did not notice up to now.
The pointer to the array of tolerances that is passed has not been implemented in the interface. (see, for instance, DDalphaAMG_solve_ms_doublet_squared_odd
.
I've set up a pull-request for this at https://github.com/Finkenrath/tmLQCD/pull/8
Sorry I missed your message the other day
@sbacchio I think there may be other problems, related to what I wrote above in https://github.com/etmc/tmLQCD/issues/362#issuecomment-298159660
If you look at the output below, the lattice dimensions seem to be completely sensible for a 3-level, 4^2*3^2 setup. However, I believe that something weird happens when the blocking is done, wouldn't you say? The run crashes during setup with a segmentation fault.
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430229 |
| csw: +1.740000 |
| mu: +0.001200 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 96 48 48 48 |
| local lattice: 8 6 6 8 |
| block lattice: 4 3 3 4 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 5 |
| test vectors: 20 |
+---------------------- depth 1 --------------------------+
| global lattice: 24 16 16 12 |
| local lattice: 24 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 12 8 8 6 |
| local lattice: 12 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.011400 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
Or would I need at least four blocks in any one of the lattice dimensions, such that even-odd on the third level works?
Oufff... No something like that should work! At what point of the setup the seg. fault appears?
On 4 May 2017 at 08:32, Bartosz Kostrzewa notifications@github.com wrote:
Or would I need at least four blocks in any one of the lattice dimensions, such that even-odd on the third level works?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etmc/tmLQCD/issues/362#issuecomment-299100026, or mute the thread https://github.com/notifications/unsubscribe-auth/AQC6BFFfCcIyBZdQfUHyPLY6v1ACFupjks5r2WLfgaJpZM4M_Fj- .
I'll check when I get to the office.
So, just to understand: if on the second level it has 2x2x2x2
local lattice points, would it automatically aggegate only in three dimensions to have 2x1x1x1
on the coarsest level? Or would it aggegate down to 1x1x1x1
on the coarsest level and simply skip every second MPI process when working on the coarsest level to do even-odd?
depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.196736 seconds
depth: 1, time spent for setting up next coarser operator: 0.266321 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.258096 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.169192 seconds
depth: 1, time spent for setting up next coarser operator: 0.268444 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.258697 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.171508 seconds
depth: 1, time spent for setting up next coarser operator: 0.268366 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.261473 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.262536 seconds
depth: 0, bootstrap step number 4...
depth: 0, time spent for setting up next coarser operator: 0.257923 seconds
depth: 1, time spent for setting up next coarser operator: 0.263163 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.257366 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.258957 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.168741 seconds
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 18571 RUNNING AT r076c06s04-hfi.marconi.cineca.it
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
I'm happy to debug this some more. Would you get more info with some of the debug compiler flags for DDalphaAMG?
On the 24c48 lattice that I mentioned in https://github.com/etmc/tmLQCD/issues/362#issuecomment-298159660, the situation is analogous:
+----------------------------------------------------------+
| 3-level method |
| postsmoothing K-cycle |
| FGMRES + red-black multiplicative Schwarz |
| restart length: 30 |
| m0: -0.430216 |
| csw: +1.740000 |
| mu: +0.004000 |
+----------------------------------------------------------+
| preconditioner cycles: 1 |
| inner solver: minimal residual iteration |
| precision: single |
+---------------------- depth 0 --------------------------+
| global lattice: 48 24 24 24 |
| local lattice: 6 6 6 6 |
| block lattice: 3 3 3 3 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 5 |
| test vectors: 20 |
+---------------------- depth 1 --------------------------+
| global lattice: 16 8 8 8 |
| local lattice: 16 2 2 2 |
| block lattice: 2 2 2 2 |
| post smooth iter: 4 |
| smoother inner iter: 4 |
| setup iter: 3 |
| test vectors: 28 |
+---------------------- depth 2 --------------------------+
| global lattice: 8 4 4 4 |
| local lattice: 8 1 1 1 |
| block lattice: 1 1 1 1 |
| coarge grid solver: odd even GMRES |
| iterations: 200 |
| cycles: 10 |
| tolerance: 1e-01 |
| mu: +0.012000 |
+----------------------------------------------------------+
| K-cycle length: 5 |
| K-cycle restarts: 2 |
| K-cycle tolerance: 1e-01 |
+----------------------------------------------------------+
depth: 0, bootstrap step number 1...
depth: 0, time spent for setting up next coarser operator: 0.101174 seconds
depth: 1, time spent for setting up next coarser operator: 0.174573 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.173764 seconds
depth: 0, bootstrap step number 2...
depth: 0, time spent for setting up next coarser operator: 0.100405 seconds
depth: 1, time spent for setting up next coarser operator: 0.171963 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.171607 seconds
depth: 0, bootstrap step number 3...
depth: 0, time spent for setting up next coarser operator: 0.101836 seconds
depth: 1, time spent for setting up next coarser operator: 0.175991 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.172342 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.171551 seconds
depth: 0, bootstrap step number 4...
depth: 0, time spent for setting up next coarser operator: 0.101485 seconds
depth: 1, time spent for setting up next coarser operator: 0.173913 seconds
depth: 1, bootstrap step number 1...
depth: 1, time spent for setting up next coarser operator: 0.171866 seconds
depth: 1, bootstrap step number 2...
depth: 1, time spent for setting up next coarser operator: 0.172153 seconds
depth: 0, bootstrap step number 5...
depth: 0, time spent for setting up next coarser operator: 0.099945 seconds
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 5407 RUNNING AT r079c02s02-hfi.marconi.cineca.it
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
So, just to understand: if on the second level it has 2x2x2x2 local lattice points, it would aggegate only in three dimensions to have 2x1x1x1 on the coarsest level? Or would it aggegate down to 1x1x1x1 on the coarsest level and simply idle one MPI process to do even-odd?
yes, since odd-even is enabled you need to have an even local volume on the coarsest grid.
ah ok, the bug is not in the setup phase, but in something done just after... Or better it's during the last setup iteration.. Mmm I should try to replicate the problem in order to study it here.
I will check on it.
yes, since odd-even is enabled you need to have an even local volume on the coarsest grid.
sorry, but yes to which question?
The rule is that you need at least a factor of 2 in the coarsest local lattice.
So yes, it should aggregate just in three directions. I understand that from tmLQCD you don't have control on the coarse block lattice, but it should be done by itself. If it doesn't work, it is easy to fix or expose the coarse block lattice.
On 4 May 2017 at 12:16, Bartosz Kostrzewa notifications@github.com wrote:
yes, since odd-even is enabled you need to have an even local volume on the coarsest grid.
sorry, but yes to which question?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/etmc/tmLQCD/issues/362#issuecomment-299134502, or mute the thread https://github.com/notifications/unsubscribe-auth/AQC6BJAJMr6Y6D4ABQTIZGJRZ6cRt67Uks5r2ZdXgaJpZM4M_Fj- .
Okay, thanks. I think it would be quite helpful if this worked at some point.
@sbacchio @Finkenrath Over the last few days I've had some time to try to understand an issue which has been bugging me a bit because I would like to run with the
TM2p1p1
branch of sbacchio/DDalphaAMG and the corresponding head commit of theDDalphaAMG_nd
branch of Finkenrath/tmLQCD to help with convergence in the heavy sector. However, I'm finding severe convergence problems and further issues. First a comparison to a working setup:When I set up the head commit of the
master
branch of sbacchio/DDalphaAMG together with the the head commit of themaster
branch of FInkenrath/tmLQCD, I get great convergence in the light sector and the expected iteration counts for the given aggregation and scale parameters.Doing the same with the aforementioned branches for 2+1+1 results in solves which do not converge and output which I have not seen before:
To compare, the working setup looks like this:
and is significantly faster, as you can see.
Have you seen this behaviour?