Got NaN on AMG-CG - Githubissues

MakotoChiba commented 3 years ago

I got NaN in some situation. When I use cuda backend of amg preconditioner(spai0 relaxation) + cg solver. Even If I use builtin backend with gauss_seidel and the other relaxation, I met same situation. (builtin backend and cuda backend is almost same speed on my environment, BTW...) Only spai1 can solve the problem. But it is really slow for solve.

Cuda Backend solver is `typedef amgcl::backend::cuda Backend; typedef amgcl::make_solver< amgcl::amg< Backend, amgcl::coarsening::smoothed_aggregation, amgcl::relaxation::spai0 >, amgcl::solver::cg

Solver;`

And Builtin Backend solver is `typedef amgcl::backend::builtin Backend; typedef amgcl::make_solver< amgcl::amg< Backend, amgcl::coarsening::smoothed_aggregation, amgcl::relaxation::gauss_seidel >, amgcl::solver::cg

Solver;`

Any good idea for solve this problem? I attach matrix and rhs by mm_write(). thanks.

matrix.zip

ddemidov commented 3 years ago

Did you try bicgstab or gmres instead of cg?

MakotoChiba commented 3 years ago

Yes, I tried both solver. But it makes NaN. And I tried Eigen(not in AMGCL, original one) with ConjugateGradient + IncompleteCholesky, it could solve.

ddemidov commented 3 years ago

Looks like the system matrix is singular:

solver -A matrix_sparse.txt -f matrix_dense.txt -p precond.coarse_enough=100000
terminate called after throwing an instance of 'std::runtime_error'
  what():  Zero sum in skyline_lu factorization
Aborted (core dumped)

(precond.coarse_enough=100000 is big enough for the direct solver to be used, but that fails with the error that usually means the matrix is singular)

ddemidov commented 3 years ago

I am able to solve this with chebyshev relaxation though:

solver -A matrix_sparse.txt -f matrix_dense.txt precond.relax.type=chebyshev
Solver
======
Type:             BiCGStab
Unknowns:         60579
Memory footprint: 3.24 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    22.04 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        60579         409271     16.00 M (61.61%)
    1         7202         216384      4.58 M (32.57%)
    2          713          38613      1.47 M ( 5.81%)

Iterations: 14
Error:      8.49983e-09

[Profile:       0.583 s] (100.00%)
[  reading:     0.327 s] ( 56.07%)
[  setup:       0.034 s] (  5.86%)
[  solve:       0.221 s] ( 37.98%)

ddemidov commented 3 years ago

damped_jacobi works even better:

solver -A matrix_sparse.txt -f matrix_dense.txt precond.relax.type=damped_jacobi
Solver
======
Type:             BiCGStab
Unknowns:         60579
Memory footprint: 3.24 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    21.52 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        60579         409271     15.53 M (61.61%)
    1         7202         216384      4.52 M (32.57%)
    2          713          38613      1.47 M ( 5.81%)

Iterations: 9
Error:      3.57144e-09

[Profile:       0.411 s] (100.00%)
[ self:         0.001 s] (  0.13%)
[  reading:     0.314 s] ( 76.36%)
[  setup:       0.036 s] (  8.86%)
[  solve:       0.060 s] ( 14.65%)

MakotoChiba commented 3 years ago

Thank you very much. I can solve it with damped_jacobi. But I have quesion.

Looks like the system matrix is singular: What meaning of this and how can I avoid (or solve) it?

And Builtin and Cuda backend get almost same speed(on Quadro P2000). And Eigen CG solver is more faster(about 20%) than amgcl. So, you may have any idea for optimize and improve speed? spai0 was slightly faster than Eigen CG.

ddemidov commented 3 years ago

Looks like the system matrix is singular: What meaning of this and how can I avoid (or solve) it?

That means the matrix does not have an inverse, although I am not sure if this is the case here. This may be unavoidable in some cases, depending on the problem you are trying to solve. For example, discretization of a Poisson problem with Neuman boundary conditions usually results in a singular matrix.

And Builtin and Cuda backend get almost same speed(on Quadro P2000). And Eigen CG solver is more faster(about 20%) than amgcl. So, you may have any idea for optimize and improve speed? spai0 was slightly faster than Eigen CG.

Make sure you are compiling with full optimization (Release mode in VS). Also, the system is probably too small to hide the overhead of using a GPU and thus get a reasonable acceleration from using a cuda backend. You may see better results with larger systems. Also, make sure you are not measuring the first run with cuda backend: cuda caches the compiled kernels, and subsequent runs are usually much faster without the compilation overhead.

MakotoChiba commented 3 years ago

Thank you for explain. Of course my build for the speed test is on the release mode. Here is the bench mark result. My target problem is Navierstoke and evolved 300 times for the bench. The only linear solver part are different in the code. And both solvers are using amg + damped Jacobi + CG.

Total evolution time.. amgcg Cuda Backend: 792.39 sec amgcg Builtin Backend: 855.11 sec

Last evolve linear solver time..(matrix system size is 10,031,868 x 10,031,868). amgcg Cuda Backend: 2.14 sec amgcg Builtin Backend: 2.97 sec

When system size is small, both Backend are almost same speed. This is because memory transfer to GPU is bottle-neck on Cuda. But last system size is enough large I think. However the difference is about 38%. Is this appropriate on the amgcl?

ddemidov commented 3 years ago

Can you provide separate times for the method setup (amgcl constructor) and solve? The setup for the cuda backend is more expensive than the setup for the builtin backend, and may dominate the total time in case the solve is fast enough.

Below are results on my system. Note that the setup for the cuda backend is twice more expensive for such a small matrix, while the solve is only twice faster (should be closer to 4x on my machine). Overall, the total time is approximately the same.

builtin

solver -A matrix_sparse.txt -f matrix_dense.txt solver.type=cg precond.relax.type=damped_jacobi
Solver
======
Type:             CG
Unknowns:         60579
Memory footprint: 1.85 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    21.52 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        60579         409271     15.53 M (61.61%)
    1         7202         216384      4.52 M (32.57%)
    2          713          38613      1.47 M ( 5.81%)

Iterations: 14
Error:      5.87806e-09

[Profile:       0.412 s] (100.00%)
[ self:         0.001 s] (  0.13%)
[  reading:     0.326 s] ( 79.23%)
[  setup:       0.030 s] (  7.27%)
[  solve:       0.055 s] ( 13.37%)

cuda

solver_cuda -A matrix_sparse.txt -f matrix_dense.txt solver.type=cg precond.relax.type=damped_jacobi
GeForce GTX 1050 Ti

Solver
======
Type:             CG
Unknowns:         60579
Memory footprint: 1.85 M

Preconditioner
==============
Number of levels:    3
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    16.76 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0        60579         409271     11.87 M (61.61%)
    1         7202         216384      3.42 M (32.57%)
    2          713          38613      1.48 M ( 5.81%)

Iterations: 14
Error:      5.87806e-09

[Profile:       0.649 s] (100.00%)
[ self:         0.230 s] ( 35.41%)
[  reading:     0.335 s] ( 51.68%)
[  setup:       0.060 s] (  9.28%)
[  solve:       0.024 s] (  3.63%)

ddemidov commented 3 years ago

You can try to switch to a single-level preconditioner to reduce the setup cost:

builtin

$ solver -A matrix_sparse.txt -f matrix_dense.txt \
    -1 solver.type=cg solver.maxiter=500 precond.type=damped_jacobi
Solver
======
Type:             CG
Unknowns:         60579
Memory footprint: 1.85 M

Preconditioner
==============
Relaxation as preconditioner
  Unknowns: 60579
  Nonzeros: 409271
  Memory:   7.17 M

Iterations: 166
Error:      9.3816e-09

[Profile:       0.463 s] (100.00%)
[ self:         0.001 s] (  0.11%)
[  reading:     0.354 s] ( 76.40%)
[  setup:       0.002 s] (  0.44%)
[  solve:       0.107 s] ( 23.04%)

cuda

$ solver_cuda -A matrix_sparse.txt -f matrix_dense.txt \
    -1 solver.type=cg solver.maxiter=500 precond.type=damped_jacobi
GeForce GTX 1050 Ti

Solver
======
Type:             CG
Unknowns:         60579
Memory footprint: 1.85 M

Preconditioner
==============
Relaxation as preconditioner
  Unknowns: 60579
  Nonzeros: 409271
  Memory:   5.38 M

Iterations: 166
Error:      9.3816e-09

[Profile:       0.629 s] (100.00%)
[ self:         0.235 s] ( 37.36%)
[  reading:     0.327 s] ( 52.07%)
[  setup:       0.007 s] (  1.03%)
[  solve:       0.060 s] (  9.53%)

MakotoChiba commented 3 years ago

Can I output these profile to the text file through function such as mm_write()? My test application has a hidden command line that I can't access now.

And more question. This is on the same solver(amg + damped jacobi + CG). GPU memory usage on the Cuda backend is much larger than I thought. 40,704300^2 matrix system is using about 2.2GB memory in the floating point precision backend and solver. I checked it on the system monitor. On the other hand, simple cg solver on the cuda by cublas and cusparse is used only about 1GB. It can estimate easily from matrix size. Of course I know it need some more memory for multi grid preconditioner and speed improvement. but over x2 is right? So, I want to estimate amgcl memory usage from matrix size. Because I want to switch cuda backend to builtin backend when matrix size is too large for the GPU, in the application. Can I know how to calculate that, approximately?

ddemidov commented 3 years ago

You can use amgcl::profiler<>, see here for an example: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db.cpp#L23

ddemidov commented 3 years ago

Re memory, amgcl::backend::bytes(solver) should return (roughly) the amount of memory allocated by the solver. That should be proportional to the system size, but the ratio should be specific to your systems. The memory structure for the multigrid is more complex (for example, you have at least three matrices on each level instead of one), so you basically are trading off the convergence speed for the memory footprint.

MakotoChiba commented 3 years ago

I checked profiles. And you right, GPU solve time is almost x7 faster than CPU and setup time was too much. Then I tried to use single-level. It has good speed and memory foot print. I knew that AMG preconditioner takes a lot of memory. Thank you very much.

ddemidov / amgcl

Got NaN on AMG-CG #181

builtin

cuda

builtin

cuda