NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
495 stars 143 forks source link

High number of iterations of PCG using AMG preconditioner #70

Closed davidherreroperez closed 3 years ago

davidherreroperez commented 5 years ago

I'm testing AMGX using following json configuration:

{ "config_version": 2, "solver": { "preconditioner": { "print_grid_stats": 1, "solver": "AMG", "algorithm": "CLASSICAL", "cycle": "V", "max_levels": 25, "max_iters": 1, "coarse_solver": "NOSOLVER", "min_coarse_rows": 6, "strength_threshold": 0.69, "interp_max_elements": 4, "interp_truncation_factor": 0.0, "max_row_sum": 0.9, "selector": "HMIS", "strength": "AHAT", "interpolator": "D2", "aggressive_levels": 0, "presweeps": 1, "postsweeps": 1, "coarsest_sweeps": 1, "norm": "L1", "smoother": { "solver": "BLOCK_JACOBI", "relaxation_factor": 0.7 } }, "monitor_residual": 1, "store_res_history": 1, "obtain_timings": 1, "print_solve_stats": 1, "solver": "PCGF", "max_iters": 50000, "convergence": "ABSOLUTE", "tolerance": 1e-12, "norm": "L2" } }

for a 2D elasticity problem, I obtain the following output:

AMGX version 2.0.0.130-opensource Built on Jun 26 2019, 12:49:02 Compiled with CUDA Runtime 10.1, using CUDA driver 10.2 AMG Grid: Number of Levels: 11 LVL ROWS NNZ SPRSTY Mem (GB)

       0(D)        37442            667012  0.000476        0.00857
       1(D)        18768            460286   0.00131         0.0111
       2(D)         9085            248323   0.00301        0.00596
       3(D)         4437            170689   0.00867        0.00401
       4(D)         2125             81971    0.0182        0.00193
       5(D)          988             42214    0.0432       0.000987
       6(D)          453             18665     0.091       0.000437
       7(D)          192              8602     0.233       0.000201
       8(D)           84              2840     0.402        6.7e-05
       9(D)           27               509     0.698       1.25e-05
      10(D)           10                92      0.92       2.37e-06
     --------------------------------------------------------------
     Grid Complexity: 1.966
     Operator Complexity: 2.55048
     Total Memory Usage: 0.0332938 GB
     --------------------------------------------------------------
       iter      Mem Usage (GB)       residual           rate
     --------------------------------------------------------------
        Ini            0.940002   1.017959e+05
          0            0.940002   2.026096e+05         1.9904
          1              0.9400   1.965676e+05         0.9702
          2              0.9400   3.044278e+05         1.5487
          3              0.9400   2.406096e+05         0.7904
          4              0.9400   1.860616e+05         0.7733
          5              0.9400   3.483283e+05         1.8721
          6              0.9400   4.396520e+05         1.2622
          7              0.9400   3.816977e+05         0.8682
          8              0.9400   5.531741e+05         1.4492
          9              0.9400   4.858742e+05         0.8783
         10              0.9400   5.429030e+05         1.1174
         11              0.9400   6.161665e+05         1.1349
         12              0.9400   5.682163e+05         0.9222
         13              0.9400   8.605111e+05         1.5144
         14              0.9400   8.649552e+05         1.0052
         15              0.9400   8.657122e+05         1.0009
         16              0.9400   7.576939e+05         0.8752
         17              0.9400   6.390614e+05         0.8434
         18              0.9400   8.154921e+05         1.2761
         19              0.9400   9.403014e+05         1.1530
         20              0.9400   1.091615e+06         1.1609
         21              0.9400   1.029720e+06         0.9433
         22              0.9400   8.169971e+05         0.7934
         23              0.9400   8.236699e+05         1.0082
         24              0.9400   1.095270e+06         1.3297
         25              0.9400   1.514124e+06         1.3824
         26              0.9400   1.884937e+06         1.2449
         27              0.9400   2.088267e+06         1.1079
         28              0.9400   2.193474e+06         1.0504
         29              0.9400   2.338554e+06         1.0661
         30              0.9400   2.475922e+06         1.0587
         31              0.9400   2.754205e+06         1.1124
         32              0.9400   2.952777e+06         1.0721
         33              0.9400   3.275251e+06         1.1092
         34              0.9400   3.238445e+06         0.9888
         35              0.9400   3.865738e+06         1.1937
         36              0.9400   5.804763e+06         1.5016
         37              0.9400   6.324366e+06         1.0895
         38              0.9400   4.985226e+06         0.7883
         39              0.9400   3.666905e+06         0.7356
         40              0.9400   2.500932e+06         0.6820
         41              0.9400   1.592519e+06         0.6368
         42              0.9400   1.012438e+06         0.6357
         43              0.9400   6.204862e+05         0.6129
         44              0.9400   3.650613e+05         0.5883
         45              0.9400   2.047202e+05         0.5608
         46              0.9400   1.143584e+05         0.5586
         47              0.9400   6.977541e+04         0.6101
         48              0.9400   5.195034e+04         0.7445
         49              0.9400   4.394363e+04         0.8459
         50              0.9400   3.453421e+04         0.7859
         51              0.9400   2.754990e+04         0.7978
         52              0.9400   1.810082e+04         0.6570
         53              0.9400   1.122531e+04         0.6202
         54              0.9400   7.026690e+03         0.6260
         55              0.9400   4.477836e+03         0.6373
         56              0.9400   3.138814e+03         0.7010
         57              0.9400   2.192449e+03         0.6985
         58              0.9400   1.438625e+03         0.6562
         59              0.9400   9.760741e+02         0.6785
         60              0.9400   6.253134e+02         0.6406
         61              0.9400   4.165686e+02         0.6662
         62              0.9400   2.673602e+02         0.6418
         63              0.9400   1.636236e+02         0.6120
         64              0.9400   1.019139e+02         0.6229
         65              0.9400   6.396497e+01         0.6276
         66              0.9400   3.913907e+01         0.6119
         67              0.9400   2.749223e+01         0.7024
         68              0.9400   2.033579e+01         0.7397
         69              0.9400   1.331817e+01         0.6549
         70              0.9400   8.211058e+00         0.6165
         71              0.9400   4.884590e+00         0.5949
         72              0.9400   2.804640e+00         0.5742
         73              0.9400   1.611881e+00         0.5747
         74              0.9400   9.454330e-01         0.5865
         75              0.9400   5.244297e-01         0.5547
         76              0.9400   2.898524e-01         0.5527
         77              0.9400   1.661665e-01         0.5733
         78              0.9400   9.825507e-02         0.5913
         79              0.9400   5.869813e-02         0.5974
         80              0.9400   3.351979e-02         0.5711
         81              0.9400   1.871421e-02         0.5583
         82              0.9400   1.032940e-02         0.5520
         83              0.9400   5.694609e-03         0.5513
         84              0.9400   3.123666e-03         0.5485
         85              0.9400   1.721449e-03         0.5511
         86              0.9400   9.461781e-04         0.5496
         87              0.9400   5.352409e-04         0.5657
         88              0.9400   2.990555e-04         0.5587
         89              0.9400   1.690378e-04         0.5652
         90              0.9400   9.826294e-05         0.5813
         91              0.9400   5.419635e-05         0.5515
         92              0.9400   2.949607e-05         0.5442
         93              0.9400   1.636941e-05         0.5550
         94              0.9400   9.352691e-06         0.5714
         95              0.9400   5.286201e-06         0.5652
         96              0.9400   3.037892e-06         0.5747
         97              0.9400   1.718972e-06         0.5658
         98              0.9400   9.272044e-07         0.5394
         99              0.9400   5.442608e-07         0.5870
        100              0.9400   5.203658e-07         0.9561
        101              0.9400   7.769338e-07         1.4931
        102              0.9400   1.212162e-06         1.5602
        103              0.9400   1.710330e-06         1.4110
        104              0.9400   2.004864e-06         1.1722
        105              0.9400   1.671733e-06         0.8338
        106              0.9400   1.122895e-06         0.6717
        107              0.9400   8.584259e-07         0.7645
        108              0.9400   1.258494e-06         1.4660
        109              0.9400   2.107670e-06         1.6748
        110              0.9400   3.378877e-06         1.6031
        111              0.9400   5.256111e-06         1.5556
        112              0.9400   6.812740e-06         1.2962
        113              0.9400   6.536627e-06         0.9595
        114              0.9400   4.766515e-06         0.7292
        115              0.9400   3.032003e-06         0.6361
        116              0.9400   1.903851e-06         0.6279
        117              0.9400   2.027909e-06         1.0652
        118              0.9400   3.117985e-06         1.5375
        119              0.9400   4.244961e-06         1.3614
        120              0.9400   4.975859e-06         1.1722
        121              0.9400   4.435136e-06         0.8913
        122              0.9400   3.259874e-06         0.7350
        123              0.9400   2.413191e-06         0.7403
        124              0.9400   2.924511e-06         1.2119
        125              0.9400   4.998405e-06         1.7091
        126              0.9400   7.475973e-06         1.4957
        127              0.9400   8.760996e-06         1.1719
        128              0.9400   7.681829e-06         0.8768
        129              0.9400   5.135122e-06         0.6685
        130              0.9400   3.088680e-06         0.6015
        131              0.9400   1.811603e-06         0.5865
        132              0.9400   1.063862e-06         0.5872
        133              0.9400   5.944577e-07         0.5588
        134              0.9400   3.333102e-07         0.5607
        135              0.9400   1.871863e-07         0.5616
        136              0.9400   1.088997e-07         0.5818

Relative residual has reached machine precision 137 0.9400 6.138344e-08 0.5637

     Total Iterations: 138
     Avg Convergence Rate:               0.8156
     Final Residual:           6.138344e-08
     Total Reduction in Residual:      6.030048e-13
     Maximum Memory Usage:                0.940 GB
     --------------------------------------------------------------

Everything seems ok, but comparing the convergence with a similar (approximately) configuration (I was not able to tune AMGX for better convergence) in the hypre for GPU implementation I obtain the following results

BoomerAMG SETUP PARAMETERS:

Max levels = 25 Num levels = 9

Strength Threshold = 0.500000 Interpolation Truncation Factor = 0.000000 Maximum Row Sum Threshold for Dependency Weakening = 0.900000

Coarsening Type = HMIS measures are determined locally

No global partition option chosen.

Interpolation = extended+i interpolation

Operator Matrix Information: lev rows entries sparse min max avg min max 0 37442 667012 0.000 8 18 17.8 -4.038e+10 1.413e+11 1 14304 403650 0.002 9 52 28.2 -2.917e+10 1.456e+11 2 6170 226908 0.006 10 99 36.8 -2.514e+10 2.145e+11 3 1904 91874 0.025 16 111 48.3 -2.250e+10 2.461e+11 4 683 39653 0.085 19 117 58.1 -4.045e+10 3.080e+11 5 235 11297 0.205 21 101 48.1 -2.390e+10 3.037e+11 6 77 2633 0.444 17 64 34.2 -2.018e+10 2.053e+11 7 23 389 0.735 10 23 16.9 -9.040e+09 2.148e+11 8 8 58 0.906 6 8 7.2 -7.191e+09 2.340e+11

Interpolation Matrix Information: entries/row min max row sums lev rows cols min max weight weight min max 0 37442 x 14304 0 4 2.150e-02 5.000e-01 0.000e+00 1.000e+00 1 14304 x 6170 1 4 2.122e-02 1.000e+00 2.778e-01 1.000e+00 2 6170 x 1904 1 4 3.040e-02 1.000e+00 1.849e-01 1.000e+00 3 1904 x 683 1 4 1.617e-02 1.000e+00 1.229e-01 1.000e+00 4 683 x 235 1 4 1.910e-02 1.000e+00 1.203e-01 1.000e+00 5 235 x 77 0 4 1.340e-02 1.000e+00 0.000e+00 1.000e+00 6 77 x 23 1 4 2.200e-02 1.000e+00 3.151e-01 1.000e+00 7 23 x 8 1 3 6.892e-02 1.000e+00 1.527e-01 1.000e+00

Complexity: grid = 1.625073 operator = 2.164090 memory = 2.375731

BoomerAMG SOLVER PARAMETERS:

Maximum number of cycles: 1 Stopping Tolerance: 0.000000e+00 Cycle type (1 = V, 2 = W, etc.): 1

Relaxation Parameters: Visiting Grid: down up coarse Number of sweeps: 1 1 1 Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 18 18 9 Point types, partial sweeps (1=C, -1=F): Pre-CG relaxation (down): 0 Post-CG relaxation (up): 0 Coarsest grid: 0

Setup phase times:

PCG Setup: wall clock time = 0.070000 seconds wall MFLOPS = 0.000000 cpu clock time = 0.071389 seconds cpu MFLOPS = 0.000000

<C*b,b>: 4.196588e+01

Iters ||r||_C conv.rate ||r||_C/||b||_C 1 6.142953e+00 0.948263 9.482634e-01 2 8.091141e+00 1.317142 1.248997e+00 3 7.651265e+00 0.945635 1.181096e+00 4 7.749352e+00 1.012820 1.196237e+00 5 5.510017e+00 0.711029 8.505596e-01 6 3.324676e+00 0.603388 5.132172e-01 7 1.725254e+00 0.518924 2.663207e-01 8 8.005101e-01 0.463995 1.235716e-01 9 3.372854e-01 0.421338 5.206542e-02 10 1.404207e-01 0.416326 2.167619e-02 11 5.562314e-02 0.396118 8.586325e-03 12 2.158910e-02 0.388132 3.332624e-03 13 8.530058e-03 0.395110 1.316751e-03 14 3.749567e-03 0.439571 5.788059e-04 15 1.667284e-03 0.444661 2.573721e-04 16 6.628488e-04 0.397562 1.023214e-04 17 2.191902e-04 0.330679 3.383553e-05 18 6.956669e-05 0.317380 1.073874e-05 19 2.326769e-05 0.334466 3.591741e-06 20 9.531555e-06 0.409648 1.471349e-06 21 4.723107e-06 0.495523 7.290875e-07 22 2.076413e-06 0.439629 3.205276e-07 23 8.300910e-07 0.399772 1.281379e-07 24 3.342506e-07 0.402667 5.159695e-08 25 1.406351e-07 0.420748 2.170929e-08 26 5.553424e-08 0.394882 8.572602e-09 27 2.060141e-08 0.370968 3.180159e-09 28 7.628654e-09 0.370298 1.177605e-09 29 2.737018e-09 0.358781 4.225027e-10 30 9.508941e-10 0.347420 1.467858e-10 31 3.180126e-10 0.334435 4.909035e-11 32 1.082480e-10 0.340389 1.670982e-11 33 3.679295e-11 0.339895 5.679583e-12 34 1.349101e-11 0.366674 2.082553e-12 35 5.658319e-12 0.419414 8.734524e-13

Looking at the timing per iteration, AMGX is better than the 2.16.0 implementation of HYPRE for GPU, but it is taking much more time to solve the problem due to the converge profile.

The residual in the initial iteration caught my attention, the initial solution is set to 0 in both implementations, but the initial residual is quite different (6.142953 vs 1.017959e+05).

Do you have any advice for improving the convergence?

marsaev commented 5 years ago

I'm surprised at the results of AMG on elasticity problem - i would expect better convergence

The residual in the initial iteration caught my attention, the initial solution is set to 0 in both implementations, but the initial residual is quite different (6.142953 vs 1.017959e+05).

Right, sounds little bit off. Does HYPRE uses for L2 norm too?

Do you have any advice for improving the convergence?

Note that you matrix is relatively small for GPU to be effective, and it's hard to make comparisons for such GPU loads. Try using solver on a coarsest level, direct solver or jacobi solver. Sometimes going over the grid multiple times is better than reducing residual on the individual level - so maybe trying to use aggressive coarsening will help - it should give smaller grid and make each cycle faster, but will likely increase number of iterations, however overall solve might be faster. Alternatively you might try different smoother or more sweeps on each level with your configuration and see if more smoothing helps.

davidherreroperez commented 5 years ago

The residual in the initial iteration caught my attention, the initial solution is set to 0 in both implementations, but the initial residual is quite different (6.142953 vs 1.017959e+05).

Right, sounds little bit off. Does HYPRE uses for L2 norm too?

Right, it using the L2 norm in hypre the residual is of the same order of magnitude, but still running fewer iterations.

Iters ||r||_2 conv.rate ||r||_2/||b||_2 1 6.115007e+05 6.007122 6.007122e+00 2 1.929486e+06 3.155330 1.895445e+01 3 2.647599e+06 1.372178 2.600889e+01 4 3.604505e+06 1.361424 3.540912e+01 5 2.765375e+06 0.767200 2.716587e+01 6 1.832435e+06 0.662635 1.800107e+01 7 9.993769e+05 0.545382 9.817454e+00 8 4.702297e+05 0.470523 4.619337e+00 9 1.911626e+05 0.406530 1.877900e+00 10 8.431356e+04 0.441057 8.282606e-01 11 3.430956e+04 0.406928 3.370425e-01 12 1.381077e+04 0.402534 1.356711e-01 13 5.321191e+03 0.385293 5.227312e-02 14 2.354299e+03 0.442438 2.312763e-02 15 1.046448e+03 0.444484 1.027986e-02 16 4.454500e+02 0.425678 4.375911e-03 17 1.491778e+02 0.334892 1.465460e-03 18 4.763654e+01 0.319327 4.679611e-04 19 1.535605e+01 0.322359 1.508514e-04 20 6.011378e+00 0.391466 5.905322e-05 21 3.134943e+00 0.521502 3.079635e-05 22 1.415178e+00 0.451421 1.390211e-05 23 5.738508e-01 0.405497 5.637266e-06 24 2.316464e-01 0.403670 2.275596e-06 25 9.736563e-02 0.420320 9.564785e-07 26 3.892466e-02 0.399778 3.823793e-07 27 1.449054e-02 0.372272 1.423489e-07 28 5.416901e-03 0.373823 5.321333e-08 29 1.966882e-03 0.363101 1.932181e-08 30 6.882711e-04 0.349930 6.761282e-09 31 2.277901e-04 0.330960 2.237713e-09 32 7.819669e-05 0.343284 7.681710e-10 33 2.616625e-05 0.334621 2.570461e-10 34 9.359429e-06 0.357691 9.194305e-11 35 3.651801e-06 0.390173 3.587374e-11 36 2.131717e-06 0.583744 2.094108e-11 37 2.857893e-06 1.340653 2.807472e-11 38 4.652273e-06 1.627868 4.570195e-11 39 3.732831e-06 0.802367 3.666974e-11 40 1.513505e-06 0.405458 1.486803e-11 41 5.176708e-07 0.342034 5.085378e-12 42 1.723921e-07 0.333015 1.693507e-12 43 5.844971e-08 0.339051 5.741851e-13

Do you have any advice for improving the convergence?

Note that you matrix is relatively small for GPU to be effective, and it's hard to make comparisons for such GPU loads. Try using solver on a coarsest level, direct solver or jacobi solver. Sometimes going over the grid multiple times is better than reducing residual on the individual level - so maybe trying to use aggressive coarsening will help - it should give smaller grid and make each cycle faster, but will likely increase number of iterations, however overall solve might be faster. Alternatively you might try different smoother or more sweeps on each level with your configuration and see if more smoothing helps.

Thank you for your comments, I'm going to test it.

davidherreroperez commented 5 years ago

Hi marsaev,

I have made a lot of tests and I was not able to reduce the number of iterations of PCG by using a solver on the coarse level of AMG preconditioner. The use of a different smoother or aggressive coarsening doesn't help either. The number of iterations of PCG can be reduced by increasing the number of iterations of AMG preconditioner at the cost of increasing the computational cost.

Looking at Table 3 of the paper

AMGX: A LIBRARY FOR GPU ACCELERATED ALGEBRAIC MULTIGRID AND PRECONDITIONED ITERATIVE METHODS, SIAM J. SCI. COMPUT. https://asc.ziti.uni-heidelberg.de/sites/default/files/research/papers/public/NaArCa_15AmgX.pdf

the number of iterations of CG-AMG for HYPRE and AMGX is of a similar order of magnitude and don't increase meaningfully with the problem size. However, these are the results that I obtain (wall clock time and iterations) increasing the problem size

scantilever2d_01_it-dof.pdf scantilever2d_01_wct-dof.pdf

Please, can you provide a similar json configuration file to test the CG-AMG mentioned at your paper?

jeaton32 commented 5 years ago

Hi David, Because this is elasticity and not just a a pressure system, there are more modes in the null space of the PDE. AmgX doesn't have an efficient elasticity solver for these kinds of problems yet. We need to add int Smoothed Aggregation AMG to enable that.

marsaev commented 3 years ago

Closing due to out of scope of AMGX