Using null-space vectors with block relaxation

dokotor commented 2 years ago

Hi, I use amgcl for large linear elasticity models. The problems only use translational degrees of freedom (3 degrees of freedom per node). Models tend to calculate fastest with block 3x3 aggregation and ilu relaxation. This settings fail to converge for some models, and overall providing null-space vectors seems to help with convergence.

When using null-space vectors, the overall calculation time increases by a lot. Looking at the profiler, it seems that relaxation step is taking a lot more time. Is there any possibility of adjusting (or editing in code) ilu relaxation, so that it behaves similarly to situation, when block solve is being run? If I understand correctly, that would give me best of both routes - coarsening using null-space vectors (better convergence) and block relaxation (faster calculation).

Here are is an example from both runs.

Solver: CG, preconditioner: AMG, coarsener: Aggregation, smoother: ILU0
Solver
======
Type:             CG
Unknowns:         1772994
Memory footprint: 54.11 M
Preconditioner
==============
Number of levels:    4
Operator complexity: 1.11
Grid complexity:     1.08
Memory footprint:    3.70 G
level     unknowns       nonzeros      memory
---------------------------------------------
    0      1772994      136267422      3.36 G (90.43%)
    1       127518       13873716    340.45 M ( 9.21%)
    2         6288         539856     13.49 M ( 0.36%)
    3          294          11700    122.57 K ( 0.01%)
Convergence status: true
Number of iterations run: 59
Residual error: 7.16936e-07
[Profile:                 18.801 s] (100.00%)
[ self:                    8.342 s] ( 44.37%)
[  CSR copy:               0.468 s] (  2.49%)
[  axpby:                  0.011 s] (  0.06%)
[  clear:                  0.021 s] (  0.11%)
[  coarse:                 1.530 s] (  8.14%)
[  coarse operator:        0.699 s] (  3.72%)
[  coarsest level:         0.001 s] (  0.01%)
[  copy:                   0.009 s] (  0.05%)
[  inner_product:          1.639 s] (  8.72%)
[  move to backend:        1.124 s] (  5.98%)
[  relax:                  0.114 s] (  0.61%)
[    axpby:                0.046 s] (  0.24%)
[    residual:             0.033 s] (  0.18%)
[    vmul:                 0.035 s] (  0.18%)
[  relaxation:             3.681 s] ( 19.58%)
[  residual:               0.023 s] (  0.12%)
[  spmv:                   0.049 s] (  0.26%)
[  transfer operators:     1.089 s] (  5.79%)
[   self:                  0.104 s] (  0.55%)
[    aggregates:           0.405 s] (  2.15%)
[    interpolation:        0.580 s] (  3.09%)
[      tentative:          0.571 s] (  3.04%)`

Solver: CG, preconditioner: AMG, coarsener: Aggregation, smoother: ILU0
Solver
======
Type:             CG
Unknowns:         590998
Memory footprint: 54.11 M
Preconditioner
==============
Number of levels:    4
Operator complexity: 1.03
Grid complexity:     1.04
Memory footprint:    1.37 G
level     unknowns       nonzeros      memory
---------------------------------------------
    0       590998       15142730      1.34 G (97.35%)
    1        22107         385465     35.99 M ( 2.48%)
    2         1774          25390      2.42 M ( 0.16%)
    3          192           1974    339.36 K ( 0.01%)
Convergence status: true
Number of iterations run: 118
Residual error: 8.33269e-07
[Profile:                 12.139 s] (100.00%)
[ self:                    5.816 s] ( 47.91%)
[  CSR copy:               0.574 s] (  4.73%)
[  axpby:                  0.011 s] (  0.09%)
[  clear:                  0.006 s] (  0.05%)
[  coarse:                 1.623 s] ( 13.37%)
[  coarse operator:        0.146 s] (  1.20%)
[  coarsest level:         0.004 s] (  0.03%)
[  copy:                   0.006 s] (  0.05%)
[  inner_product:          1.691 s] ( 13.93%)
[  move to backend:        0.506 s] (  4.17%)
[  relax:                  0.135 s] (  1.11%)
[    axpby:                0.034 s] (  0.28%)
[    residual:             0.058 s] (  0.48%)
[    vmul:                 0.042 s] (  0.34%)
[  relaxation:             1.307 s] ( 10.77%)
[  residual:               0.023 s] (  0.19%)
[  spmv:                   0.065 s] (  0.54%)
[  transfer operators:     0.225 s] (  1.85%)
[    aggregates:           0.208 s] (  1.71%)
[    interpolation:        0.005 s] (  0.04%)
[      tentative:          0.004 s] (  0.03%)

On an unrelated note, I am also struggling with using smoothed aggregation with null-space vectors. They calculate much longer that standard aggregation, which seems counter-intuitive.

ddemidov commented 2 years ago

Using null-space vectors usually results in a heavier hierarchy (the matrices contain more unknowns and more non-zero elements). Looks like you are working with 3D models, so you probably are passing 6 null-space vectors. This means that each aggregate at the fine grid results in 6 coarse-grid unknowns. Using 3x3 blocks should be equivalent to 3 null-space vectors (only displacements, without rotations), and the resulting coarse grid is twice lighter.

This is confirmed by your logs: when null-space vectors are used, the first coarse grid contains 127518 DOFs, as opposed to 22107x3 = 66321 DOFs in the case when 3x3 blocks are used.

Also, the block-wise CSR format is simply more efficient storage-wise than the scalar one: your top-level matrix takes 3.36 GB with scalar backend, and 1.34 GB with block-wise one. Since most operations in amgcl are memory-bound, this should be the largest performance hit: sparse-matrix vector (SPMV) product is the most important operation, and it takes proportionally more time to perform with a heavier matrix. ILU0 relaxation is also based on a series of SPMVs.

In other words, the block-wise backend is simply more efficient, but it lacks some of the information from the null-space vectors, so sometimes it does not converge. I guess the resulting matrices still have a block-wise structure, so it should be possible to represent them using block CSR, but I am not sure yet how hard would it be to implement.

Re smoothed aggregation with null-space vectors: smoothed aggregation is standard aggregation followed by matrix smoothing (basically a matrix-matrix product), so it does take more time to compute and results in the heavier transfer operators. If you mean it takes more iterations to converge, and you are using block-wise backend, it sometimes helps to set precond.coarsening.aggr.eps_strong parameter to 0.

ddemidov commented 2 years ago

Would you be able to share the system matrix, RHS, and the null-space vectors for your problem? It would be interesting to try to allow using null-space vectors with the block-wise backends.

ddemidov commented 2 years ago

There is an initial test of builtin-hybrid backend in the backend-builtin-hybrid branch. There, the AMG hierarchy is constructed using scalar value types, and the matrices are converted to block-wise format when moved to the backend. This should improve the solve efficiency when null-space vectors are used.

It is assumed that both the initial matrix and all the matrices at the coarser level may be represented with single block size. It should be the case for 3D elasticity problems, since the initial matrix should have a 3x3 block structure, and using 6 null-space vectors means that the coarser-level matrices have a 6x6 block structure, which still may be represented by 3x3 blocks.

For now, the backend does not provide row iterator for the matrices (because matrices are block-valued and vectors are scalar-valued), so gauss-seidel relaxation is not supported, and ILU-type relaxations are approximated using Jacobi iterations similarly to the GPU backends.

Consider the following example from Tutorial/Using near null-space vectors. The reduced memory footprint and solve time are the main things to note:

Standard scalar backend

solver -B -A A.bin -f b.bin -C C.bin precond.coarsening.aggr.eps_strong=0
Solver
======
Type:             BiCGStab
Unknowns:         81657
Memory footprint: 4.36 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.52
Grid complexity:     1.10
Memory footprint:    132.15 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        81657        3171111    102.70 M (65.77%)
    1         7704        1640736     29.33 M (34.03%)
    2          144           9576    122.07 K ( 0.20%)

Iterations: 38
Error:      9.56397e-09

[Profile:       1.818 s] (100.00%)
[  reading:     0.035 s] (  1.95%)
[  setup:       0.348 s] ( 19.15%)
[  solve:       1.433 s] ( 78.84%)

Hybrid backend

solver_hybrid -B -A A.bin -f b.bin -C C.bin precond.coarsening.aggr.eps_strong=0 -b3
Solver
======
Type:             BiCGStab
Unknowns:         81657
Memory footprint: 4.36 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.52
Grid complexity:     1.10
Memory footprint:    74.37 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        81657        3171111     57.88 M (65.77%)
    1         7704        1640736     16.37 M (34.03%)
    2          144           9576    122.07 K ( 0.20%)

Iterations: 38
Error:      9.46914e-09

[Profile:       1.353 s] (100.00%)
[  reading:     0.037 s] (  2.72%)
[  setup:       0.367 s] ( 27.15%)
[  solve:       0.948 s] ( 70.09%)

dokotor commented 2 years ago

Wow, you are so fast! I was supposed to gather the models and post them for you today, but you beat me to it.

First let me thank you for your in-depth explanation. It's well put and I understand it much better now. Your latest response with the description of the implementation is also very clear.

That is exactly something I was thinking of. I will checkout your branch and post the updated results this week. As I understand from your description, this will also work with vexcl backend?

Also, the minimal changes you had to introduce to implement such feature prove the extensibility and customizability of the AMGCL library. Great job.

ddemidov commented 2 years ago

So far this is only an extension of the builtin backend, but similar approach should work for vexcl.

ddemidov commented 2 years ago

99b037f084dbda7b0fcf9be46313fe1f9bb88d70 adds support for hybrid vexcl backend. Same example as above:

Scalar VexCL

solver_vexcl_cuda -B -A A.bin -f b.bin -C C.bin precond.coarsening.aggr.eps_strong=0
1. NVIDIA GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         81657
Memory footprint: 4.36 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.52
Grid complexity:     1.10
Memory footprint:    132.16 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        81657        3171111    102.70 M (65.77%)
    1         7704        1640736     29.33 M (34.03%)
    2          144           9576    124.32 K ( 0.20%)

Iterations: 38
Error:      9.40462e-09

[Profile:       0.967 s] (100.00%)
[ self:         0.083 s] (  8.62%)
[  reading:     0.031 s] (  3.21%)
[  setup:       0.407 s] ( 42.08%)
[  solve:       0.446 s] ( 46.10%)

Hybrid VexCL

solver_hybrid_vexcl_cuda -B -A A.bin -f b.bin -C C.bin precond.coarsening.aggr.eps_strong=0 -b3
1. NVIDIA GeForce GTX 1050 Ti

Solver
======
Type:             BiCGStab
Unknowns:         81657
Memory footprint: 4.36 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.52
Grid complexity:     1.10
Memory footprint:    74.37 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        81657        3171111     57.88 M (65.77%)
    1         7704        1640736     16.37 M (34.03%)
    2          144           9576    124.32 K ( 0.20%)

Iterations: 38
Error:      9.43085e-09

[Profile:       0.809 s] (100.00%)
[ self:         0.089 s] ( 10.97%)
[  reading:     0.035 s] (  4.39%)
[  setup:       0.407 s] ( 50.29%)
[  solve:       0.278 s] ( 34.35%)

dokotor commented 2 years ago

Your hybrid backend works perfectly. I have tested both CPU and VexCL backends and both worked out of the box. Below are CPU results from another model.

Scalar CPU

Method: AMGCL Scalar OpenMP Iterative
Solver: CG, preconditioner: AMG
Coarsening: Aggregation (with null-space vectors), relaxation: Chebyshev
Solver
======
Type:             CG
Unknowns:         763776
Memory footprint: 23.31 M
Preconditioner
==============
Number of levels:    3
Operator complexity: 1.15
Grid complexity:     1.08
Memory footprint:    1.22 G
level     unknowns       nonzeros      memory
---------------------------------------------
    0       763776       59541004      1.06 G (87.10%)
    1        59430        8551332    144.56 M (12.51%)
    2         2328         267912     15.88 M ( 0.39%)
Convergence status: true
Number of iterations run: 56
Residual error: 7.88856e-07
[Profile:                 22.396 s] (100.00%)
[ self:                    6.334 s] ( 28.28%)
[  CSR copy:               0.104 s] (  0.47%)
[  axpby:                  0.095 s] (  0.42%)
[  clear:                  0.005 s] (  0.02%)
[  coarse:                 0.162 s] (  0.72%)
[  coarse operator:        0.404 s] (  1.80%)
[  coarsest level:         0.579 s] (  2.59%)
[  copy:                   0.000 s] (  0.00%)
[  inner_product:          0.108 s] (  0.48%)
[  move to backend:        0.002 s] (  0.01%)
[  relax:                  9.393 s] ( 41.94%)
[    axpby:                0.259 s] (  1.16%)
[    residual:             9.133 s] ( 40.78%)
[  relaxation:             0.039 s] (  0.17%)
[    spectral radius:      0.038 s] (  0.17%)
[  residual:               2.320 s] ( 10.36%)
[  spmv:                   2.358 s] ( 10.53%)
[  transfer operators:     0.494 s] (  2.21%)
[   self:                  0.055 s] (  0.25%)
[    aggregates:           0.231 s] (  1.03%)
[    interpolation:        0.208 s] (  0.93%)
[      tentative:          0.206 s] (  0.92%)

Hybrid CPU

Method: AMGCL Hybrid OpenMP Iterative
Solver: CG, preconditioner: AMG
Coarsening: Aggregation (with null-space vectors), relaxation: Chebyshev
Solver
======
Type:             CG
Unknowns:         763776
Memory footprint: 23.31 M
Preconditioner
==============
Number of levels:    3
Operator complexity: 1.15
Grid complexity:     1.08
Memory footprint:    716.45 M
level     unknowns       nonzeros      memory
---------------------------------------------
    0       763776       59541004    619.38 M (87.10%)
    1        59430        8551332     81.19 M (12.51%)
    2         2328         267912     15.88 M ( 0.39%)
Convergence status: true
Number of iterations run: 56
Residual error: 8.06027e-07
[Profile:                 18.560 s] (100.00%)
[ self:                    6.211 s] ( 33.46%)
[  CSR copy:               0.102 s] (  0.55%)
[  axpby:                  0.093 s] (  0.50%)
[  clear:                  0.004 s] (  0.02%)
[  coarse:                 0.167 s] (  0.90%)
[  coarse operator:        0.376 s] (  2.03%)
[  coarsest level:         0.564 s] (  3.04%)
[  copy:                   0.000 s] (  0.00%)
[  inner_product:          0.109 s] (  0.59%)
[  move to backend:        0.201 s] (  1.08%)
[    CSR copy:             0.199 s] (  1.07%)
[  relax:                  6.784 s] ( 36.55%)
[    axpby:                0.254 s] (  1.37%)
[    residual:             6.529 s] ( 35.18%)
[      residual:           6.528 s] ( 35.17%)
[  relaxation:             0.035 s] (  0.19%)
[    spectral radius:      0.034 s] (  0.18%)
[  residual:               1.709 s] (  9.21%)
[    residual:             1.709 s] (  9.21%)
[  spmv:                   1.730 s] (  9.32%)
[    spmv:                 1.730 s] (  9.32%)
[  transfer operators:     0.474 s] (  2.55%)
[   self:                  0.055 s] (  0.30%)
[    aggregates:           0.219 s] (  1.18%)
[    interpolation:        0.200 s] (  1.08%)
[      tentative:          0.199 s] (  1.07%)

For this model the solve time dropped from 14s to 10s, setup is unchanged. It's really impressive.

Thanks a lot for your help. I don't know what you think, but probably after extending this to other backends, this could become a permanent new feature in AMGCL?

ddemidov commented 2 years ago

I've merged the branch into the master. The builtin and the vexcl backends are the only ones where this will work anyway.

ddemidov commented 2 years ago

I have added the corresponding section to the Using near null-space vectors tutorial, referencing this issue.

dokotor commented 2 years ago

Cool, thanks for mentioning me in the documentation and referencing this issue.

ddemidov commented 2 years ago

Another possible approach is to use the new coarsening::as_scalar wrapper from fe47a96. This way, the complete hierarchy is built using the block value types, but the coarsening algorithm converts its input matrix to scalar format before processing. The advantage is that something like an ILU relaxation is now built in the block space which means faster and in less memory. A usage example: https://gist.github.com/ddemidov/79939b5db87957cdffb63f1deb0c75ee

The results for the builtin backend (same system as above):

$ ./block 
Matrix: 81657x81657
RHS: 81657x1
Coords: 27219x3
Solver
======
Type:             CG
Unknowns:         27219
Memory footprint: 2.49 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.52
Grid complexity:     1.10
Memory footprint:    114.19 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        27219         352371     83.75 M (65.77%)
    1         2568         182304     30.32 M (34.03%)
    2           48           1064    121.32 K ( 0.20%)

Iters: 29
Error: 5.92119e-09

[Nullspace:     3.052 s] (100.00%)
[  read:        2.213 s] ( 72.51%)
[  setup:       0.312 s] ( 10.22%)
[  solve:       0.524 s] ( 17.18%)

dokotor commented 2 years ago

Sorry, but I didn't have time earlier to try your newest changes. I have problems with compiling latest AMGCL actually. I am using MSVC 2022 and there is an error in ilu_solve class specialized for hybrid backend:

amgcl/relaxation/detail/ilu_solve.hpp(464,9): error C3210: 'ilu_solve<amgcl::backend::builtin<double,__int64,__int64> >': a member using-declaration can only be applied to a base class member
amgcl/relaxation/ilu0.hpp(68): message : see reference to class template instantiation 'amgcl::relaxation::detail::ilu_solve<Backend>' being compiled

ddemidov commented 2 years ago

Does 4efd828147165715aad830129d1027e9a8c67785 help?

ddemidov commented 2 years ago

See the comparison of the two methods (hybrid backend vs block backend with coarsening wrapped into as_scalar) here: https://arxiv.org/pdf/2202.09056.pdf.

dokotor commented 2 years ago

Wow, you have already published a paper on that subject, that is fast. I am getting some other errors now:

amgcl/backend/builtin.hpp(1070,1): error C2446: ':': no conversion from 'amgcl::static_matrix<double,3,3>' to 'int' 
amgcl/backend/builtin.hpp(1070,48): message : No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
amgcl/backend/builtin.hpp(1069): message : while compiling class template member function 'size_t amgcl::backend::nonzeros_impl<Matrix,void>::get(const amgcl::backend::crs<double,amgcl::static_matrix<double,3,3>,amgcl::static_matrix<double,3,3>> &)'
amgcl/backend/interface.hpp(292): message : see reference to function template instantiation 'size_t amgcl::backend::nonzeros_impl<Matrix,void>::get(const amgcl::backend::crs<double,amgcl::static_matrix<double,3,3>,amgcl::static_matrix<double,3,3>> &)' being compiled
\amgcl/backend/interface.hpp(292): message : see reference to class template instantiation 'amgcl::backend::nonzeros_impl<Matrix,void>' being compiled
amgcl/preconditioner/dummy.hpp(97): message : see reference to function template instantiation 'size_t amgcl::backend::nonzeros<amgcl::backend::crs<double,amgcl::static_matrix<double,3,3>,amgcl::static_matrix<double,3,3>>>(const Matrix &)' being compiled

ddemidov commented 2 years ago

The hybrid_backend used to specify both block and scalar value types. Later I realized that the scalar type may be derived from the block one, and removed the scalar value type template parameter. It looks like you are still specifying both. See the working example here: https://github.com/ddemidov/block_nullspace_benchmarks/blob/master/ns_hybrid2.cpp#L38

dokotor commented 2 years ago

I am definitely getting closer...

amgcl/backend/vexcl.hpp(288,1): error C2440: 'initializing': cannot convert from 'amgcl::static_matrix<double,3,3>' to 'size_t'
amgcl/backend/vexcl.hpp(288,35): message : No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
amgcl/backend/vexcl.hpp(281): message : while compiling class template member function 'std::shared_ptr<vex::sparse::distributed<vex::sparse::matrix<double,amgcl::solver::vexcl_skyline_lu<double>,amgcl::solver::vexcl_skyline_lu<double>>,double>> amgcl::backend::vexcl_hybrid<double,amgcl::static_matrix<double,3,3>,ColumnType,amgcl::solver::vexcl_skyline_lu<double>>::copy_matrix(std::shared_ptr<amgcl::backend::crs<double,amgcl::static_matrix<double,3,3>,amgcl::static_matrix<double,3,3>>>,const amgcl::backend::vexcl<double,DirectSolver,amgcl::solver::vexcl_skyline_lu<double>,DirectSolver>::params &)'
amgcl/amg.hpp(416): message : see reference to function template instantiation 'std::shared_ptr<vex::sparse::distributed<vex::sparse::matrix<double,amgcl::solver::vexcl_skyline_lu<double>,amgcl::solver::vexcl_skyline_lu<double>>,double>> amgcl::backend::vexcl_hybrid<double,amgcl::static_matrix<double,3,3>,ColumnType,amgcl::solver::vexcl_skyline_lu<double>>::copy_matrix(std::shared_ptr<amgcl::backend::crs<double,amgcl::static_matrix<double,3,3>,amgcl::static_matrix<double,3,3>>>,const amgcl::backend::vexcl<double,DirectSolver,amgcl::solver::vexcl_skyline_lu<double>,DirectSolver>::params &)' being compiled
amgcl/preconditioner/runtime.hpp(102): message : see reference to class template instantiation 'amgcl::backend::vexcl_hybrid<double,amgcl::static_matrix<double,3,3>,ColumnType,amgcl::solver::vexcl_skyline_lu<double>>' being compiled

...and also starting to feel I am unlucky.

ddemidov commented 2 years ago

This looks to be the same error, but for the vexcl_hybrid backend. Can you check the template parameters there?

dokotor commented 2 years ago

You are right, but I am still getting a ton of other errors... I think Friday afternoon is not a good time to try out new things, I will get back to the topic next week and update you on the results :) I really want to test this new as_scalar coarsening.

ddemidov commented 2 years ago

Can you extract your code into a small compilable example (or rather almost compilable)?

dokotor commented 2 years ago

Sure, I will do it next week.

ddemidov commented 2 years ago

I've added the vexcl examples in https://github.com/ddemidov/block_nullspace_benchmarks/commit/27676874d0985e1bceb43157fe12f2c91815aecf and fixed a couple of hybrid vexcl backend related errors in 26d71e5884d552cdf588234ac72b8ba7b33ebb2e. Hope it works now.

ddemidov / amgcl

Using null-space vectors with block relaxation #215