[Core] - AMGCL Solution differs with different number of iterations/threads

sunethwarna commented 10 months ago

Description Recently, AMGCL started to give totally different solutions (refer sol_1, and sol_2) for

Different number of max_iteration
Different number of threads (solution for threads = 10 differs from threads = 15)

It throws a warning saying it is not converged to 1e-12 (but it is converged to 1e-11, which should be close enough to have a small difference in the solution). The difference in the solution is very large which was causing the one of the tests to fail in CI (refer #11760).

If you reduce the tolerance to 1e-10, then the difference between two solutions (sol_1 and sol_2) is 0.0.

Following is the script to replicate the bug. I am attaching the A.mm, b.mm.rhs and python script in the zip file. data.zip

import scipy
import numpy
import KratosMultiphysics as Kratos
from KratosMultiphysics.python_linear_solver_factory import ConstructSolver

def Solve(lhs: Kratos.CompressedMatrix, rhs: Kratos.Vector, number_of_iterations: int) -> Kratos.Vector:
    amgcl_solver_settings = Kratos.Parameters("""
        {
            "preconditioner_type"            : "amg",
            "solver_type"                    : "amgcl",
            "smoother_type"                  : "ilu0",
            "krylov_type"                    : "gmres",
            "coarsening_type"                : "aggregation",
            "max_iteration"                  : 100,
            "provide_coordinates"            : false,
            "gmres_krylov_space_dimension"   : 100,
            "verbosity"                      : 1,
            "tolerance"                      : 1e-12,
            "scaling"                        : false,
            "block_size"                     : 1,
            "use_block_matrices_if_possible" : true,
            "coarse_enough"                  : 1000,
            "max_levels"                     : -1,
            "pre_sweeps"                     : 1,
            "post_sweeps"                    : 1,
            "use_gpgpu"                      : false
        }""")
    amgcl_solver_settings["max_iteration"].SetInt(number_of_iterations)

    sol = Kratos.Vector(rhs.Size(), 0.0)
    linear_solver = ConstructSolver(amgcl_solver_settings)
    linear_solver.Solve(lhs, sol, rhs)
    return sol

if __name__ == "__main__":

    scipy_sparse_lhs: numpy.ndarray = numpy.array(scipy.io.mmread("A.mm").toarray())
    scipy_sparse_rhs = numpy.array(scipy.io.mmread("b.mm.rhs"))

    # now add values to the sparse matrix
    kratos_lhs = Kratos.CompressedMatrix(scipy_sparse_lhs.shape[0], scipy_sparse_lhs.shape[1])
    kratos_rhs = Kratos.Vector(scipy_sparse_lhs.shape[0], 0.0)
    for i in range(scipy_sparse_lhs.shape[0]):
        kratos_rhs[i] = scipy_sparse_rhs[i, 0]
        for j in range(scipy_sparse_lhs.shape[1]):
            if scipy_sparse_lhs[i, j] != 0.0:
                kratos_lhs[i, j] = scipy_sparse_lhs[i,j]

    sol_1 = Solve(kratos_lhs, kratos_rhs, 100) # solve for 100 iterations
    sol_2 = Solve(kratos_lhs, kratos_rhs, 200) # solve for 200 iterations

    print(numpy.linalg.norm(sol_1 - sol_2))

Scope

KratosCore

To Reproduce Unzip the contents of the attached zip file and run test_linear_solver.py

Expected behavior To print 0.0.

Environment

OS: Manjaro Linux (latest), Ubuntu
Branch: master
Python 3.11.5

@roigcarlo @matekelemen @Igarizza

ddemidov commented 10 months ago

How recently? I dont think that amgcl code in Kratos has been updated recently.

roigcarlo commented 10 months ago

Weren't there some PR last week / two weeks ago about the AMGCL wrappers? #10691, #10390, #11687, but not sure if they can be related in any way

sunethwarna commented 10 months ago

@ddemidov we found out this when we were debugging the test failures in optimization App in CI. So I think this started to fail like 1 or two weeks back (not sure when)

ddemidov commented 10 months ago

This one is the most recent: #11687 (two weeks ago), others are from january and 2022. The only thing that changed in #11687 was that PRESSURE variable stopped being hard-coded. Could that be the reason?

ddiezrod commented 10 months ago

My change affects only amgcl_ns so I do no think that could change your result...

sunethwarna commented 10 months ago

@ddemidov @ddiezrod Can you try the script (in data.zip) to see whether this is re-producible in your PCs?

my master commit hash is: edcb5f605aed4cedd498ef306025b12343e4ec2b

ddemidov commented 10 months ago

I am sorry, I don't have a working Kratos environment on my machine

matekelemen commented 10 months ago

It's not too surprising that the max iteration count has an effect on the solutions if the solver fails to converge. What's more worrysome is that the number of threads has an impact as well.

sunethwarna commented 10 months ago

@matekelemen In this case it makes a difference since the warning says it is converged upto 1e-11, and it cannot reach 1e-12. So the solution should be the same for both iterations since both number of iterations reach 1e-11 convergence.

matekelemen commented 10 months ago

I don't think at all that the results should be the same.

I assume the solver continues iterating until one of the following is satisfied: 1) the residual is reduced below the target tolerance 2) the max iteration limit is reached

Both cases terminate on condition 2, and since the max iteration limit is different, the solver performs different number of iterations on the result and the results will obviously be different. Why do you think otherwise?

ddiezrod commented 10 months ago

@sunethwarna I am running your case with current master.

If the solution has not converged its normal that you get different values using different max_iterations as @matekelemen says
I think the main problem here is that you are using a very low tolerance (1e-12) so you are very close to machine precision. A linear solver simply cannot go that far. It surprised me a little that result changes when changing the number of threads (it also happened to me in my machine), but again, as long as this is happening when comparing such small values, I do no think there is a reson to worry too much.
According to @ddemidov there has not been any change in amgcl in the last months, if you want to make sure nothing got broken you can recompile an older version, but from my experience, this is all normal.

sunethwarna commented 10 months ago

If the solution has not converged its normal that you get different values using different max_iterations as @matekelemen says

The issue is, the residual it prints from AMGCL says, it is converged around 1e-11, so the solution should not be so much different if I run it for couple of more iterations. Shoud it be? Am i missing something?

I think the main problem here is that you are using a very low tolerance (1e-12) so you are very close to machine precision. A linear solver simply cannot go that far. It surprised me a little that result changes when changing the number of threads (it also happened to me in my machine), but again, as long as this is happening when comparing such small values, I do no think there is a reson to worry too much.

This started to happen around 1e-6, I put 1e-12 to get rid of the tolerance issues. The change of the norms are aroung 6000.0 (the printed value) which is not a small change in two solutions as far as i see.

According to @ddemidov there has not been any change in amgcl in the last months, if you want to make sure nothing got broken you can recompile an older version, but from my experience, this is all normal.

Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again.

rubenzorrilla commented 10 months ago

Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again.

Could this be a slight difference in the underlying linear algebra implementation of each system?

ddiezrod commented 10 months ago

Ok, I took a look to the actual result vector and that is worrying. This is the output printing sol_1 and sol_2.

 |  /           |
 ' /   __| _` | __|  _ \   __|
 . \  |   (   | |   (   |\__ \
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 9.4."1"-core/adding-vars-from-processes-4cba1d1b42-RelWithDebInfo-x86_64
           Compiled for Windows and Python3.9 with MSVC-1928
Compiled with threading support.
Maximum number of threads: 10.
Process Id: 31208
Linear-Solver-Factory: Constructing a regular (non-complex) linear-solver
[WARNING] AMGCL Linear Solver: Non converged linear solution. [5.34367e-11 > 1e-12]
Linear-Solver-Factory: Constructing a regular (non-complex) linear-solver
[WARNING] AMGCL Linear Solver: Non converged linear solution. [5.20678e-11 > 1e-12]
[27](2.00165e+06,2.00003e+06,2.00003e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00114e+06,-999662,-999662,-1.00114e+06,-999662,-999662,2.00165e+06,2.00003e+06,2.00003e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00114e+06,-999662,-999662,2.00165e+06,2.00003e+06,2.00003e+06)
[27](2.001e+06,2.00213e+06,2.00213e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,2.001e+06,2.00213e+06,2.00213e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,2.001e+06,2.00213e+06,2.00213e+06)
6598.907873980675

There is a big difference in the results, even if the residual norm is small. I wonder where this is coming from....

ddemidov commented 10 months ago

The convergence criteria is the relative error, so could it be that the rhs norm is huge, which makes the differences between solutions relatively small?

sunethwarna commented 10 months ago

@ddemidov The RHS values are between 0 - 0.5, the lhs values are around 1e-1, 1e-2

matekelemen commented 10 months ago

I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected.

ddiezrod commented 10 months ago

@matekelemen Yes, I was thinking about that. Does this system come from a real problem?

sunethwarna commented 10 months ago

@ddiezrod This is a matrix which I took from the failing test of OptApp. When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL)

loumalouomega commented 10 months ago

I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected.

Is the matrix scaled?, there is an option in the builder and solver

sunethwarna commented 10 months ago

Following is our observation The tests which was failing in OptApp was not changed for last 5 months. And they started failing recently. [It is testing an area which is isolated even in OptApp]. Now tests are passing with "skyline_lu" solver.

Following are my concerns:

Why it started to fail recently? (even if it already had a high condition number already)
Why it fails in some operating systems when AMGCL is used?

We are also lost in here to identify the problem :/

matekelemen commented 10 months ago

When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL)

AMGCL also manages to solve the system if you change the subsolver to conjugate gradients. I guess fine-tuning the solver is unavoidable with ill-conditioned systems like this.

Why it fails in some operating systems when AMGCL is used?

I used to work on another python project which also used hard-coded values as references to system tests. I also noticed small deviations across machines (even between different machines running the same Linux distro and same updates). I never found out what the issue was :/

loumalouomega commented 10 months ago

Following is our observation The tests which was failing in OptApp was not changed for last 5 months. And they started failing recently. [It is testing an area which is isolated even in OptApp]. Now tests are passing with "skyline_lu" solver.

Following are my concerns:
1. Why it started to fail recently? (even if it already had a high condition number already)

2. Why it fails in some operating systems when AMGCL is used?
We are also lost in here to identify the problem :/

Can you check it is related with this default change?: https://github.com/KratosMultiphysics/Kratos/pull/11138

sunethwarna commented 10 months ago

@loumalouomega I will try it in the coming days and update here :)

RiccardoRossi commented 10 months ago

guys, on one side i agree about the comment about the system conditioning. A system condition of 1e13 implies the system is essentially undefined.

aside, please consider that when you do floating point operations in paraellel you are loosing the predictability.

just think of adding

1e-4 + 1e10 + 1e-1

the result will be different depending on the order at which you do the sum ... which is not guaranteed when you are in parallel

matekelemen commented 10 months ago

when you do floating point operations in parallel you are losing predictability

If that really is the issue here, I'm not sure how to deal with it.

Increasing the tolerance is not ideal because it's pretty much a hack, and we can't really produce an upper bound that guarantees to cover these kinds of deviations.
Restricting the thread count to 1 is also a no-go because we don't check for race conditions specifically (e.g.: with thread sanitizer), but hope that we catch them by checking the results of standard tests.

I assume the solution will be to run these tests with a predefined set of threads (same as we do with MPI processes), but I don't know how to come up with robust tolerances.

KratosMultiphysics / Kratos

[Core] - AMGCL Solution differs with different number of iterations/threads #11763