Closed MakotoChiba closed 3 years ago
Did you try bicgstab or gmres instead of cg?
Yes, I tried both solver. But it makes NaN. And I tried Eigen(not in AMGCL, original one) with ConjugateGradient + IncompleteCholesky, it could solve.
Looks like the system matrix is singular:
solver -A matrix_sparse.txt -f matrix_dense.txt -p precond.coarse_enough=100000
terminate called after throwing an instance of 'std::runtime_error'
what(): Zero sum in skyline_lu factorization
Aborted (core dumped)
(precond.coarse_enough=100000
is big enough for the direct solver to be used, but that fails with the error that usually means the matrix is singular)
I am able to solve this with chebyshev relaxation though:
solver -A matrix_sparse.txt -f matrix_dense.txt precond.relax.type=chebyshev
Solver
======
Type: BiCGStab
Unknowns: 60579
Memory footprint: 3.24 M
Preconditioner
==============
Number of levels: 3
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 22.04 M
level unknowns nonzeros memory
---------------------------------------------
0 60579 409271 16.00 M (61.61%)
1 7202 216384 4.58 M (32.57%)
2 713 38613 1.47 M ( 5.81%)
Iterations: 14
Error: 8.49983e-09
[Profile: 0.583 s] (100.00%)
[ reading: 0.327 s] ( 56.07%)
[ setup: 0.034 s] ( 5.86%)
[ solve: 0.221 s] ( 37.98%)
damped_jacobi works even better:
solver -A matrix_sparse.txt -f matrix_dense.txt precond.relax.type=damped_jacobi
Solver
======
Type: BiCGStab
Unknowns: 60579
Memory footprint: 3.24 M
Preconditioner
==============
Number of levels: 3
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 21.52 M
level unknowns nonzeros memory
---------------------------------------------
0 60579 409271 15.53 M (61.61%)
1 7202 216384 4.52 M (32.57%)
2 713 38613 1.47 M ( 5.81%)
Iterations: 9
Error: 3.57144e-09
[Profile: 0.411 s] (100.00%)
[ self: 0.001 s] ( 0.13%)
[ reading: 0.314 s] ( 76.36%)
[ setup: 0.036 s] ( 8.86%)
[ solve: 0.060 s] ( 14.65%)
Thank you very much. I can solve it with damped_jacobi. But I have quesion.
Looks like the system matrix is singular:
What meaning of this and how can I avoid (or solve) it?
And Builtin and Cuda backend get almost same speed(on Quadro P2000). And Eigen CG solver is more faster(about 20%) than amgcl. So, you may have any idea for optimize and improve speed? spai0 was slightly faster than Eigen CG.
Looks like the system matrix is singular: What meaning of this and how can I avoid (or solve) it?
That means the matrix does not have an inverse, although I am not sure if this is the case here. This may be unavoidable in some cases, depending on the problem you are trying to solve. For example, discretization of a Poisson problem with Neuman boundary conditions usually results in a singular matrix.
And Builtin and Cuda backend get almost same speed(on Quadro P2000). And Eigen CG solver is more faster(about 20%) than amgcl. So, you may have any idea for optimize and improve speed? spai0 was slightly faster than Eigen CG.
Make sure you are compiling with full optimization (Release mode in VS). Also, the system is probably too small to hide the overhead of using a GPU and thus get a reasonable acceleration from using a cuda backend. You may see better results with larger systems. Also, make sure you are not measuring the first run with cuda backend: cuda caches the compiled kernels, and subsequent runs are usually much faster without the compilation overhead.
Thank you for explain. Of course my build for the speed test is on the release mode. Here is the bench mark result. My target problem is Navierstoke and evolved 300 times for the bench. The only linear solver part are different in the code. And both solvers are using amg + damped Jacobi + CG.
Total evolution time.. amgcg Cuda Backend: 792.39 sec amgcg Builtin Backend: 855.11 sec
Last evolve linear solver time..(matrix system size is 10,031,868 x 10,031,868). amgcg Cuda Backend: 2.14 sec amgcg Builtin Backend: 2.97 sec
When system size is small, both Backend are almost same speed. This is because memory transfer to GPU is bottle-neck on Cuda. But last system size is enough large I think. However the difference is about 38%. Is this appropriate on the amgcl?
Can you provide separate times for the method setup (amgcl constructor) and solve? The setup for the cuda backend is more expensive than the setup for the builtin backend, and may dominate the total time in case the solve is fast enough.
Below are results on my system. Note that the setup for the cuda backend is twice more expensive for such a small matrix, while the solve is only twice faster (should be closer to 4x on my machine). Overall, the total time is approximately the same.
solver -A matrix_sparse.txt -f matrix_dense.txt solver.type=cg precond.relax.type=damped_jacobi
Solver
======
Type: CG
Unknowns: 60579
Memory footprint: 1.85 M
Preconditioner
==============
Number of levels: 3
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 21.52 M
level unknowns nonzeros memory
---------------------------------------------
0 60579 409271 15.53 M (61.61%)
1 7202 216384 4.52 M (32.57%)
2 713 38613 1.47 M ( 5.81%)
Iterations: 14
Error: 5.87806e-09
[Profile: 0.412 s] (100.00%)
[ self: 0.001 s] ( 0.13%)
[ reading: 0.326 s] ( 79.23%)
[ setup: 0.030 s] ( 7.27%)
[ solve: 0.055 s] ( 13.37%)
solver_cuda -A matrix_sparse.txt -f matrix_dense.txt solver.type=cg precond.relax.type=damped_jacobi
GeForce GTX 1050 Ti
Solver
======
Type: CG
Unknowns: 60579
Memory footprint: 1.85 M
Preconditioner
==============
Number of levels: 3
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 16.76 M
level unknowns nonzeros memory
---------------------------------------------
0 60579 409271 11.87 M (61.61%)
1 7202 216384 3.42 M (32.57%)
2 713 38613 1.48 M ( 5.81%)
Iterations: 14
Error: 5.87806e-09
[Profile: 0.649 s] (100.00%)
[ self: 0.230 s] ( 35.41%)
[ reading: 0.335 s] ( 51.68%)
[ setup: 0.060 s] ( 9.28%)
[ solve: 0.024 s] ( 3.63%)
You can try to switch to a single-level preconditioner to reduce the setup cost:
$ solver -A matrix_sparse.txt -f matrix_dense.txt \
-1 solver.type=cg solver.maxiter=500 precond.type=damped_jacobi
Solver
======
Type: CG
Unknowns: 60579
Memory footprint: 1.85 M
Preconditioner
==============
Relaxation as preconditioner
Unknowns: 60579
Nonzeros: 409271
Memory: 7.17 M
Iterations: 166
Error: 9.3816e-09
[Profile: 0.463 s] (100.00%)
[ self: 0.001 s] ( 0.11%)
[ reading: 0.354 s] ( 76.40%)
[ setup: 0.002 s] ( 0.44%)
[ solve: 0.107 s] ( 23.04%)
$ solver_cuda -A matrix_sparse.txt -f matrix_dense.txt \
-1 solver.type=cg solver.maxiter=500 precond.type=damped_jacobi
GeForce GTX 1050 Ti
Solver
======
Type: CG
Unknowns: 60579
Memory footprint: 1.85 M
Preconditioner
==============
Relaxation as preconditioner
Unknowns: 60579
Nonzeros: 409271
Memory: 5.38 M
Iterations: 166
Error: 9.3816e-09
[Profile: 0.629 s] (100.00%)
[ self: 0.235 s] ( 37.36%)
[ reading: 0.327 s] ( 52.07%)
[ setup: 0.007 s] ( 1.03%)
[ solve: 0.060 s] ( 9.53%)
Can I output these profile to the text file through function such as mm_write()? My test application has a hidden command line that I can't access now.
And more question. This is on the same solver(amg + damped jacobi + CG). GPU memory usage on the Cuda backend is much larger than I thought. 40,704300^2 matrix system is using about 2.2GB memory in the floating point precision backend and solver. I checked it on the system monitor. On the other hand, simple cg solver on the cuda by cublas and cusparse is used only about 1GB. It can estimate easily from matrix size. Of course I know it need some more memory for multi grid preconditioner and speed improvement. but over x2 is right? So, I want to estimate amgcl memory usage from matrix size. Because I want to switch cuda backend to builtin backend when matrix size is too large for the GPU, in the application. Can I know how to calculate that, approximately?
You can use amgcl::profiler<>
, see here for an example: https://github.com/ddemidov/amgcl/blob/master/tutorial/1.poisson3Db/poisson3Db.cpp#L23
Re memory, amgcl::backend::bytes(solver)
should return (roughly) the amount of memory allocated by the solver. That should be proportional to the system size, but the ratio should be specific to your systems. The memory structure for the multigrid is more complex (for example, you have at least three matrices on each level instead of one), so you basically are trading off the convergence speed for the memory footprint.
I checked profiles. And you right, GPU solve time is almost x7 faster than CPU and setup time was too much. Then I tried to use single-level. It has good speed and memory foot print. I knew that AMG preconditioner takes a lot of memory. Thank you very much.
I got NaN in some situation. When I use cuda backend of amg preconditioner(spai0 relaxation) + cg solver. Even If I use builtin backend with gauss_seidel and the other relaxation, I met same situation. (builtin backend and cuda backend is almost same speed on my environment, BTW...) Only spai1 can solve the problem. But it is really slow for solve.
Cuda Backend solver is `typedef amgcl::backend::cuda Backend;
typedef amgcl::make_solver< amgcl::amg< Backend,
amgcl::coarsening::smoothed_aggregation,
amgcl::relaxation::spai0 >,
amgcl::solver::cg
And Builtin Backend solver is `typedef amgcl::backend::builtin Backend;
typedef amgcl::make_solver< amgcl::amg< Backend,
amgcl::coarsening::smoothed_aggregation,
amgcl::relaxation::gauss_seidel >,
amgcl::solver::cg
Any good idea for solve this problem? I attach matrix and rhs by mm_write(). thanks.
matrix.zip