Open sunethwarna opened 10 months ago
How recently? I dont think that amgcl code in Kratos has been updated recently.
Weren't there some PR last week / two weeks ago about the AMGCL wrappers? #10691, #10390, #11687, but not sure if they can be related in any way
@ddemidov we found out this when we were debugging the test failures in optimization App in CI. So I think this started to fail like 1 or two weeks back (not sure when)
This one is the most recent: #11687 (two weeks ago), others are from january and 2022. The only thing that changed in #11687 was that PRESSURE variable stopped being hard-coded. Could that be the reason?
My change affects only amgcl_ns so I do no think that could change your result...
@ddemidov @ddiezrod Can you try the script (in data.zip) to see whether this is re-producible in your PCs?
my master commit hash is: edcb5f605aed4cedd498ef306025b12343e4ec2b
I am sorry, I don't have a working Kratos environment on my machine
It's not too surprising that the max iteration count has an effect on the solutions if the solver fails to converge. What's more worrysome is that the number of threads has an impact as well.
@matekelemen In this case it makes a difference since the warning says it is converged upto 1e-11, and it cannot reach 1e-12. So the solution should be the same for both iterations since both number of iterations reach 1e-11 convergence.
I don't think at all that the results should be the same.
I assume the solver continues iterating until one of the following is satisfied: 1) the residual is reduced below the target tolerance 2) the max iteration limit is reached
Both cases terminate on condition 2, and since the max iteration limit is different, the solver performs different number of iterations on the result and the results will obviously be different. Why do you think otherwise?
@sunethwarna I am running your case with current master.
If the solution has not converged its normal that you get different values using different max_iterations as @matekelemen says
The issue is, the residual it prints from AMGCL says, it is converged around 1e-11, so the solution should not be so much different if I run it for couple of more iterations. Shoud it be? Am i missing something?
I think the main problem here is that you are using a very low tolerance (1e-12) so you are very close to machine precision. A linear solver simply cannot go that far. It surprised me a little that result changes when changing the number of threads (it also happened to me in my machine), but again, as long as this is happening when comparing such small values, I do no think there is a reson to worry too much.
This started to happen around 1e-6, I put 1e-12 to get rid of the tolerance issues. The change of the norms are aroung 6000.0 (the printed value) which is not a small change in two solutions as far as i see.
According to @ddemidov there has not been any change in amgcl in the last months, if you want to make sure nothing got broken you can recompile an older version, but from my experience, this is all normal.
Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again.
Worrying part is, we see different solutions in MacOS than the Ubuntu and Manjaro. I will get an aold version of Kratos and check again.
Could this be a slight difference in the underlying linear algebra implementation of each system?
Ok, I took a look to the actual result vector and that is worrying. This is the output printing sol_1 and sol_2.
| / |
' / __| _` | __| _ \ __|
. \ | ( | | ( |\__ \
_|\_\_| \__,_|\__|\___/ ____/
Multi-Physics 9.4."1"-core/adding-vars-from-processes-4cba1d1b42-RelWithDebInfo-x86_64
Compiled for Windows and Python3.9 with MSVC-1928
Compiled with threading support.
Maximum number of threads: 10.
Process Id: 31208
Linear-Solver-Factory: Constructing a regular (non-complex) linear-solver
[WARNING] AMGCL Linear Solver: Non converged linear solution. [5.34367e-11 > 1e-12]
Linear-Solver-Factory: Constructing a regular (non-complex) linear-solver
[WARNING] AMGCL Linear Solver: Non converged linear solution. [5.20678e-11 > 1e-12]
[27](2.00165e+06,2.00003e+06,2.00003e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00114e+06,-999662,-999662,-1.00114e+06,-999662,-999662,2.00165e+06,2.00003e+06,2.00003e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00051e+06,-1.00036e+06,-1.00036e+06,-1.00114e+06,-999662,-999662,2.00165e+06,2.00003e+06,2.00003e+06)
[27](2.001e+06,2.00213e+06,2.00213e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,2.001e+06,2.00213e+06,2.00213e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.0003e+06,-1.00105e+06,-1.00105e+06,-1.00069e+06,-1.00108e+06,-1.00108e+06,2.001e+06,2.00213e+06,2.00213e+06)
6598.907873980675
There is a big difference in the results, even if the residual norm is small. I wonder where this is coming from....
The convergence criteria is the relative error, so could it be that the rhs norm is huge, which makes the differences between solutions relatively small?
@ddemidov The RHS values are between 0 - 0.5, the lhs values are around 1e-1, 1e-2
I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected.
@matekelemen Yes, I was thinking about that. Does this system come from a real problem?
@ddiezrod This is a matrix which I took from the failing test of OptApp. When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL)
I think the matrix' conditioning is more to blame (condition number is ~5e13), so I think oscillations on the magnitude of what @ddiezrod shows are to be expected.
Is the matrix scaled?, there is an option in the builder and solver
Following is our observation The tests which was failing in OptApp was not changed for last 5 months. And they started failing recently. [It is testing an area which is isolated even in OptApp]. Now tests are passing with "skyline_lu" solver.
Following are my concerns:
We are also lost in here to identify the problem :/
When I switch to a different linear solver, the test started to Pass (Now it is failing in MeshMovingApplication where it is again using AMGCL)
AMGCL also manages to solve the system if you change the subsolver to conjugate gradients. I guess fine-tuning the solver is unavoidable with ill-conditioned systems like this.
Why it fails in some operating systems when AMGCL is used?
I used to work on another python project which also used hard-coded values as references to system tests. I also noticed small deviations across machines (even between different machines running the same Linux distro and same updates). I never found out what the issue was :/
Following is our observation The tests which was failing in OptApp was not changed for last 5 months. And they started failing recently. [It is testing an area which is isolated even in OptApp]. Now tests are passing with "skyline_lu" solver.
Following are my concerns:
1. Why it started to fail recently? (even if it already had a high condition number already) 2. Why it fails in some operating systems when AMGCL is used?
We are also lost in here to identify the problem :/
Can you check it is related with this default change?: https://github.com/KratosMultiphysics/Kratos/pull/11138
@loumalouomega I will try it in the coming days and update here :)
guys, on one side i agree about the comment about the system conditioning. A system condition of 1e13 implies the system is essentially undefined.
aside, please consider that when you do floating point operations in paraellel you are loosing the predictability.
just think of adding
1e-4 + 1e10 + 1e-1
the result will be different depending on the order at which you do the sum ... which is not guaranteed when you are in parallel
when you do floating point operations in parallel you are losing predictability
If that really is the issue here, I'm not sure how to deal with it.
Increasing the tolerance is not ideal because it's pretty much a hack, and we can't really produce an upper bound that guarantees to cover these kinds of deviations.
Restricting the thread count to 1 is also a no-go because we don't check for race conditions specifically (e.g.: with thread sanitizer), but hope that we catch them by checking the results of standard tests.
I assume the solution will be to run these tests with a predefined set of threads (same as we do with MPI processes), but I don't know how to come up with robust tolerances.
Description Recently, AMGCL started to give totally different solutions (refer
sol_1
, andsol_2
) formax_iteration
It throws a warning saying it is not converged to 1e-12 (but it is converged to 1e-11, which should be close enough to have a small difference in the solution). The difference in the solution is very large which was causing the one of the tests to fail in CI (refer #11760).
If you reduce the tolerance to 1e-10, then the difference between two solutions (
sol_1
andsol_2
) is 0.0.Following is the script to replicate the bug. I am attaching the
A.mm
,b.mm.rhs
and python script in the zip file. data.zipScope
To Reproduce Unzip the contents of the attached zip file and run
test_linear_solver.py
Expected behavior To print 0.0.
Environment
@roigcarlo @matekelemen @Igarizza