Closed davidherreroperez closed 3 years ago
I'm surprised at the results of AMG on elasticity problem - i would expect better convergence
The residual in the initial iteration caught my attention, the initial solution is set to 0 in both implementations, but the initial residual is quite different (6.142953 vs 1.017959e+05).
Right, sounds little bit off. Does HYPRE uses for L2 norm too?
Do you have any advice for improving the convergence?
Note that you matrix is relatively small for GPU to be effective, and it's hard to make comparisons for such GPU loads. Try using solver on a coarsest level, direct solver or jacobi solver. Sometimes going over the grid multiple times is better than reducing residual on the individual level - so maybe trying to use aggressive coarsening will help - it should give smaller grid and make each cycle faster, but will likely increase number of iterations, however overall solve might be faster. Alternatively you might try different smoother or more sweeps on each level with your configuration and see if more smoothing helps.
The residual in the initial iteration caught my attention, the initial solution is set to 0 in both implementations, but the initial residual is quite different (6.142953 vs 1.017959e+05).
Right, sounds little bit off. Does HYPRE uses for L2 norm too?
Right, it using the L2 norm in hypre the residual is of the same order of magnitude, but still running fewer iterations.
Iters ||r||_2 conv.rate ||r||_2/||b||_2 1 6.115007e+05 6.007122 6.007122e+00 2 1.929486e+06 3.155330 1.895445e+01 3 2.647599e+06 1.372178 2.600889e+01 4 3.604505e+06 1.361424 3.540912e+01 5 2.765375e+06 0.767200 2.716587e+01 6 1.832435e+06 0.662635 1.800107e+01 7 9.993769e+05 0.545382 9.817454e+00 8 4.702297e+05 0.470523 4.619337e+00 9 1.911626e+05 0.406530 1.877900e+00 10 8.431356e+04 0.441057 8.282606e-01 11 3.430956e+04 0.406928 3.370425e-01 12 1.381077e+04 0.402534 1.356711e-01 13 5.321191e+03 0.385293 5.227312e-02 14 2.354299e+03 0.442438 2.312763e-02 15 1.046448e+03 0.444484 1.027986e-02 16 4.454500e+02 0.425678 4.375911e-03 17 1.491778e+02 0.334892 1.465460e-03 18 4.763654e+01 0.319327 4.679611e-04 19 1.535605e+01 0.322359 1.508514e-04 20 6.011378e+00 0.391466 5.905322e-05 21 3.134943e+00 0.521502 3.079635e-05 22 1.415178e+00 0.451421 1.390211e-05 23 5.738508e-01 0.405497 5.637266e-06 24 2.316464e-01 0.403670 2.275596e-06 25 9.736563e-02 0.420320 9.564785e-07 26 3.892466e-02 0.399778 3.823793e-07 27 1.449054e-02 0.372272 1.423489e-07 28 5.416901e-03 0.373823 5.321333e-08 29 1.966882e-03 0.363101 1.932181e-08 30 6.882711e-04 0.349930 6.761282e-09 31 2.277901e-04 0.330960 2.237713e-09 32 7.819669e-05 0.343284 7.681710e-10 33 2.616625e-05 0.334621 2.570461e-10 34 9.359429e-06 0.357691 9.194305e-11 35 3.651801e-06 0.390173 3.587374e-11 36 2.131717e-06 0.583744 2.094108e-11 37 2.857893e-06 1.340653 2.807472e-11 38 4.652273e-06 1.627868 4.570195e-11 39 3.732831e-06 0.802367 3.666974e-11 40 1.513505e-06 0.405458 1.486803e-11 41 5.176708e-07 0.342034 5.085378e-12 42 1.723921e-07 0.333015 1.693507e-12 43 5.844971e-08 0.339051 5.741851e-13
Do you have any advice for improving the convergence?
Note that you matrix is relatively small for GPU to be effective, and it's hard to make comparisons for such GPU loads. Try using solver on a coarsest level, direct solver or jacobi solver. Sometimes going over the grid multiple times is better than reducing residual on the individual level - so maybe trying to use aggressive coarsening will help - it should give smaller grid and make each cycle faster, but will likely increase number of iterations, however overall solve might be faster. Alternatively you might try different smoother or more sweeps on each level with your configuration and see if more smoothing helps.
Thank you for your comments, I'm going to test it.
Hi marsaev,
I have made a lot of tests and I was not able to reduce the number of iterations of PCG by using a solver on the coarse level of AMG preconditioner. The use of a different smoother or aggressive coarsening doesn't help either. The number of iterations of PCG can be reduced by increasing the number of iterations of AMG preconditioner at the cost of increasing the computational cost.
Looking at Table 3 of the paper
AMGX: A LIBRARY FOR GPU ACCELERATED ALGEBRAIC MULTIGRID AND PRECONDITIONED ITERATIVE METHODS, SIAM J. SCI. COMPUT. https://asc.ziti.uni-heidelberg.de/sites/default/files/research/papers/public/NaArCa_15AmgX.pdf
the number of iterations of CG-AMG for HYPRE and AMGX is of a similar order of magnitude and don't increase meaningfully with the problem size. However, these are the results that I obtain (wall clock time and iterations) increasing the problem size
scantilever2d_01_it-dof.pdf scantilever2d_01_wct-dof.pdf
Please, can you provide a similar json configuration file to test the CG-AMG mentioned at your paper?
Hi David, Because this is elasticity and not just a a pressure system, there are more modes in the null space of the PDE. AmgX doesn't have an efficient elasticity solver for these kinds of problems yet. We need to add int Smoothed Aggregation AMG to enable that.
Closing due to out of scope of AMGX
I'm testing AMGX using following json configuration:
{ "config_version": 2, "solver": { "preconditioner": { "print_grid_stats": 1, "solver": "AMG", "algorithm": "CLASSICAL", "cycle": "V", "max_levels": 25, "max_iters": 1, "coarse_solver": "NOSOLVER", "min_coarse_rows": 6, "strength_threshold": 0.69, "interp_max_elements": 4, "interp_truncation_factor": 0.0, "max_row_sum": 0.9, "selector": "HMIS", "strength": "AHAT", "interpolator": "D2", "aggressive_levels": 0, "presweeps": 1, "postsweeps": 1, "coarsest_sweeps": 1, "norm": "L1", "smoother": { "solver": "BLOCK_JACOBI", "relaxation_factor": 0.7 } }, "monitor_residual": 1, "store_res_history": 1, "obtain_timings": 1, "print_solve_stats": 1, "solver": "PCGF", "max_iters": 50000, "convergence": "ABSOLUTE", "tolerance": 1e-12, "norm": "L2" } }
for a 2D elasticity problem, I obtain the following output:
AMGX version 2.0.0.130-opensource Built on Jun 26 2019, 12:49:02 Compiled with CUDA Runtime 10.1, using CUDA driver 10.2 AMG Grid: Number of Levels: 11 LVL ROWS NNZ SPRSTY Mem (GB)
Relative residual has reached machine precision 137 0.9400 6.138344e-08 0.5637
Everything seems ok, but comparing the convergence with a similar (approximately) configuration (I was not able to tune AMGX for better convergence) in the hypre for GPU implementation I obtain the following results
BoomerAMG SETUP PARAMETERS:
Max levels = 25 Num levels = 9
Strength Threshold = 0.500000 Interpolation Truncation Factor = 0.000000 Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = HMIS measures are determined locally
No global partition option chosen.
Interpolation = extended+i interpolation
Operator Matrix Information: lev rows entries sparse min max avg min max 0 37442 667012 0.000 8 18 17.8 -4.038e+10 1.413e+11 1 14304 403650 0.002 9 52 28.2 -2.917e+10 1.456e+11 2 6170 226908 0.006 10 99 36.8 -2.514e+10 2.145e+11 3 1904 91874 0.025 16 111 48.3 -2.250e+10 2.461e+11 4 683 39653 0.085 19 117 58.1 -4.045e+10 3.080e+11 5 235 11297 0.205 21 101 48.1 -2.390e+10 3.037e+11 6 77 2633 0.444 17 64 34.2 -2.018e+10 2.053e+11 7 23 389 0.735 10 23 16.9 -9.040e+09 2.148e+11 8 8 58 0.906 6 8 7.2 -7.191e+09 2.340e+11
Interpolation Matrix Information: entries/row min max row sums lev rows cols min max weight weight min max 0 37442 x 14304 0 4 2.150e-02 5.000e-01 0.000e+00 1.000e+00 1 14304 x 6170 1 4 2.122e-02 1.000e+00 2.778e-01 1.000e+00 2 6170 x 1904 1 4 3.040e-02 1.000e+00 1.849e-01 1.000e+00 3 1904 x 683 1 4 1.617e-02 1.000e+00 1.229e-01 1.000e+00 4 683 x 235 1 4 1.910e-02 1.000e+00 1.203e-01 1.000e+00 5 235 x 77 0 4 1.340e-02 1.000e+00 0.000e+00 1.000e+00 6 77 x 23 1 4 2.200e-02 1.000e+00 3.151e-01 1.000e+00 7 23 x 8 1 3 6.892e-02 1.000e+00 1.527e-01 1.000e+00
Complexity: grid = 1.625073 operator = 2.164090 memory = 2.375731
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1 Stopping Tolerance: 0.000000e+00 Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters: Visiting Grid: down up coarse Number of sweeps: 1 1 1 Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 18 18 9 Point types, partial sweeps (1=C, -1=F): Pre-CG relaxation (down): 0 Post-CG relaxation (up): 0 Coarsest grid: 0
Setup phase times:
PCG Setup: wall clock time = 0.070000 seconds wall MFLOPS = 0.000000 cpu clock time = 0.071389 seconds cpu MFLOPS = 0.000000
<C*b,b>: 4.196588e+01
Iters ||r||_C conv.rate ||r||_C/||b||_C 1 6.142953e+00 0.948263 9.482634e-01 2 8.091141e+00 1.317142 1.248997e+00 3 7.651265e+00 0.945635 1.181096e+00 4 7.749352e+00 1.012820 1.196237e+00 5 5.510017e+00 0.711029 8.505596e-01 6 3.324676e+00 0.603388 5.132172e-01 7 1.725254e+00 0.518924 2.663207e-01 8 8.005101e-01 0.463995 1.235716e-01 9 3.372854e-01 0.421338 5.206542e-02 10 1.404207e-01 0.416326 2.167619e-02 11 5.562314e-02 0.396118 8.586325e-03 12 2.158910e-02 0.388132 3.332624e-03 13 8.530058e-03 0.395110 1.316751e-03 14 3.749567e-03 0.439571 5.788059e-04 15 1.667284e-03 0.444661 2.573721e-04 16 6.628488e-04 0.397562 1.023214e-04 17 2.191902e-04 0.330679 3.383553e-05 18 6.956669e-05 0.317380 1.073874e-05 19 2.326769e-05 0.334466 3.591741e-06 20 9.531555e-06 0.409648 1.471349e-06 21 4.723107e-06 0.495523 7.290875e-07 22 2.076413e-06 0.439629 3.205276e-07 23 8.300910e-07 0.399772 1.281379e-07 24 3.342506e-07 0.402667 5.159695e-08 25 1.406351e-07 0.420748 2.170929e-08 26 5.553424e-08 0.394882 8.572602e-09 27 2.060141e-08 0.370968 3.180159e-09 28 7.628654e-09 0.370298 1.177605e-09 29 2.737018e-09 0.358781 4.225027e-10 30 9.508941e-10 0.347420 1.467858e-10 31 3.180126e-10 0.334435 4.909035e-11 32 1.082480e-10 0.340389 1.670982e-11 33 3.679295e-11 0.339895 5.679583e-12 34 1.349101e-11 0.366674 2.082553e-12 35 5.658319e-12 0.419414 8.734524e-13
Looking at the timing per iteration, AMGX is better than the 2.16.0 implementation of HYPRE for GPU, but it is taking much more time to solve the problem due to the converge profile.
The residual in the initial iteration caught my attention, the initial solution is set to 0 in both implementations, but the initial residual is quite different (6.142953 vs 1.017959e+05).
Do you have any advice for improving the convergence?