KratosMultiphysics / Kratos

Kratos Multiphysics (A.K.A Kratos) is a framework for building parallel multi-disciplinary simulation software. Modularity, extensibility and HPC are the main objectives. Kratos has BSD license and is written in C++ with extensive Python interface.
https://kratosmultiphysics.github.io/Kratos/
Other
1.01k stars 244 forks source link

Unpredictable result of ResidualCriteria when used with constraints #6148

Open RiccardoRossi opened 4 years ago

RiccardoRossi commented 4 years ago

I am posting this issue since while doing #6147 i realized that there is some parallelization issue with the ResidualCriteria when constraints are employed.

I took a look into the code, and could not find an obvious source of error

I'll take however the occasion to make two comments: 1 - this: https://github.com/KratosMultiphysics/Kratos/blob/master/kratos/solving_strategies/convergencecriterias/residual_criteria.h#L343

is extremely costly in mpi. So costly that i would prefer throwin an "unsupported" error rather than using it this way...

2 - minor: we do not need residual_dof_value to be firstprivate. It is better to define it within the loop to be purist, on the contrary it_dof_begin should be firstprivate

neither of this explains the parallelization problem, nevertheless at least 1 is important

3 -

loumalouomega commented 4 years ago

I noticed this long time ago, I already have mentioned several times (always in person, not as an issue)

philbucher commented 4 years ago

1 - this: https://github.com/KratosMultiphysics/Kratos/blob/master/kratos/solving_strategies/convergencecriterias/residual_criteria.h#L343

is extremely costly in mpi. So costly that i would prefer throwin an "unsupported" error rather than using it this way...

this was a hotfix after the mpi-version was completely broken (didn't even compile) due to the changes that were done in a hurry before the previous release

I have this on my todo list forever already...

I think by using GatherValues we could make it work both with and without MPI Or do you know a better way of only accessing the local entries in the vector @RiccardoRossi ?

RiccardoRossi commented 4 years ago

Yes GatherValues will be way better (it does one single communication instea of very many small ones). Even like this however i would say that the current code is not correct, since the residual associated to some dofs would be computed multiple times if the dof belongs to multiple processors. A better fix for this is to compute the ActiveDofs in a way that nonlocal dofs are not added in the computation of the residual. that would require no communication at all...

the openmp error on the contrary is misterious to me

loumalouomega commented 4 years ago

A better fix for this is to compute the ActiveDofs in a way that nonlocal dofs are not added in the computation of the residual. that would require no communication at all...

What do you mean with nonlocal dofs?, the mActivedofs are computed outside the OMP loop.

RiccardoRossi commented 4 years ago

i am.referring to mpi, to answer to philipp

On Mon, Dec 23, 2019, 10:55 AM Vicente Mataix Ferrándiz < notifications@github.com> wrote:

A better fix for this is to compute the ActiveDofs in a way that nonlocal dofs are not added in the computation of the residual. that would require no communication at all...

What do you mean with nonlocal dofs?, the activedofs are computed outside the omp loop.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KratosMultiphysics/Kratos/issues/6148?email_source=notifications&email_token=AB5PWEI36LNGFB6KID7IYSLQ2CDJPA5CNFSM4J6MYK52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHQXW4Q#issuecomment-568425330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5PWEK2F4HISNLSYYH7A4TQ2CDJPANCNFSM4J6MYK5Q .

loumalouomega commented 4 years ago

OK, but what is failing then in OMP?

RiccardoRossi commented 4 years ago

teo different issues really

1 it detects convergence at different monents with openmp 2 implemebtation is incorrect (i think) in mpi

On Mon, Dec 23, 2019, 11:43 AM Vicente Mataix Ferrándiz < notifications@github.com> wrote:

OK, but what is failing then in OMP?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KratosMultiphysics/Kratos/issues/6148?email_source=notifications&email_token=AB5PWENSOFF2HQACUOWRGMDQ2CI5LA5CNFSM4J6MYK52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHQ26DA#issuecomment-568438540, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5PWEJQ7IALWTWAQ6RWASDQ2CI5LANCNFSM4J6MYK5Q .

philbucher commented 4 years ago

What makes you sure the omp problem is in the criteria? I mean, there is not much code in there

loumalouomega commented 4 years ago

What makes you sure the omp problem is in the criteria? I mean, there is not much code in there

I agree, I have a test in the contact with MPC that fails 1/6 times (TBeamAxilTetraContactTest BTW). And commenting this OMP loop does not solve the problem.

RiccardoRossi commented 4 years ago

i tried the test in #6147 (which has one element and two constraints) with 1000 repetitions.

if i use DisplacementCriteria it never failed. If i use the ResidualCriteria it fails many times.

When using the ResidualCriteria, the reason of the failure is that convergence is SOMETIMES detected earlier, when the nonlinear problem is not yet fully arrived to convergence. I checked with valgrind and we have no leakage. Also the problem disappears if OMP_NUM_THREADS=1

loumalouomega commented 4 years ago

Yes, but I commented the OMP loops in the criteria and the problem does not disappear. May be the problem in the RHS computation?

El lun., 23 dic. 2019 18:23, Riccardo Rossi notifications@github.com escribió:

i tried the test in #6147 https://github.com/KratosMultiphysics/Kratos/pull/6147 (which has one element and two constraints) with 1000 repetitions.

if i use DisplacementCriteria it never failed. If i use the ResidualCriteria it fails many times.

When using the ResidualCriteria, the reason of the failure is that convergence is SOMETIMES detected earlier, when the nonlinear problem is not yet fully arrived to convergence. I checked with valgrind and we have no leakage. Also the problem disappears if OMP_NUM_THREADS=1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KratosMultiphysics/Kratos/issues/6148?email_source=notifications&email_token=AEYQZAASFBCQLMHPPO63XHDQ2DXZNA5CNFSM4J6MYK52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHRR4FY#issuecomment-568532503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYQZAFIVBZJJTCDG5W7RZLQ2DXZNANCNFSM4J6MYK5Q .

loumalouomega commented 4 years ago

Did you try in your super simple test to comment the OMP loops?

El lun., 23 dic. 2019 18:29, Vicente Mataix Ferrándiz < vicente.mataix@polytechnique.edu> escribió:

Yes, but I commented the OMP loops in the criteria and the problem does not disappear. May be the problem in the RHS computation?

El lun., 23 dic. 2019 18:23, Riccardo Rossi notifications@github.com escribió:

i tried the test in #6147 https://github.com/KratosMultiphysics/Kratos/pull/6147 (which has one element and two constraints) with 1000 repetitions.

if i use DisplacementCriteria it never failed. If i use the ResidualCriteria it fails many times.

When using the ResidualCriteria, the reason of the failure is that convergence is SOMETIMES detected earlier, when the nonlinear problem is not yet fully arrived to convergence. I checked with valgrind and we have no leakage. Also the problem disappears if OMP_NUM_THREADS=1

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KratosMultiphysics/Kratos/issues/6148?email_source=notifications&email_token=AEYQZAASFBCQLMHPPO63XHDQ2DXZNA5CNFSM4J6MYK52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHRR4FY#issuecomment-568532503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYQZAFIVBZJJTCDG5W7RZLQ2DXZNANCNFSM4J6MYK5Q .

RiccardoRossi commented 4 years ago

ok, i found the source of the error: it is in the ActiveDofs

i am printing the ActiveDof around line 333 of the residual_crieteria and rb

CORRECT RUN:

 |  /           |
 ' /   __| _` | __|  _ \   __|
 . \  |   (   | |   (   |\__ \
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 7.0.0-62673851f3-FullDebug
Compiled with OpenMP support.
Maximum OpenMP threads: 4.
Importing    KratosStructuralMechanicsApplication 
    KRATOS   ___|  |                   |                   |
           \___ \  __|  __| |   |  __| __| |   |  __| _` | |
                 | |   |    |   | (    |   |   | |   (   | |
           _____/ \__|_|   \__,_|\___|\__|\__,_|_|  \__,_|_| MECHANICS
Initializing KratosStructuralMechanicsApplication...
Setup Dofs Time: 0.00355026
Setup System Time: 0.00517196
System Matrix Resize Time: 0.0174384
System Construction Time: 0.0262975
mActiveDofs : [0, 0, 0, 0, 1, 1, 0, 0, 0]
rb : [9](0,0,-6.47151e-16,-10,-1.59465e-15,-10,-6.47151e-16,-10,0)
rDofNum : 16386624
rResidualSolutionNorm : 0
SolvingStrategy:  MESH MOVED 
ResidualBasedBlockBuilderAndSolver: Build time: 0.00445519
ResidualBasedBlockBuilderAndSolver: System solve time: 0.00699715
SolvingStrategy:  MESH MOVED 
mActiveDofs : [0, 0, 0, 0, 1, 1, 0, 0, 0]
rb : [9](0,0,4.90753,-10,4.10783e-15,-0.184948,-4.90753,-10,0)
rDofNum : 16386624
rResidualSolutionNorm : 0
ResidualBasedBlockBuilderAndSolver: Build time: 0.00750696
ResidualBasedBlockBuilderAndSolver: System solve time: 0.00693493
SolvingStrategy:  MESH MOVED 
mActiveDofs : [0, 0, 0, 0, 1, 1, 0, 0, 0]
rb : [9](0,0,4.99997,-10,8.88178e-16,-6.76344e-05,-4.99997,-10,0)
rDofNum : 16386624
rResidualSolutionNorm : 0.184948
ResidualBasedBlockBuilderAndSolver: Build time: 0.00550778
ResidualBasedBlockBuilderAndSolver: System solve time: 0.00543896
SolvingStrategy:  MESH MOVED 
mActiveDofs : [0, 0, 0, 0, 1, 1, 0, 0, 0]
rb : [9](0,0,5,-10,-8.77076e-15,-9.06342e-12,-5,-10,0)
rDofNum : 16386624
rResidualSolutionNorm : 6.76344e-05
ResidualBasedBlockBuilderAndSolver: Build time: 0.00885934
ResidualBasedBlockBuilderAndSolver: System solve time: 0.00709531
SolvingStrategy:  MESH MOVED 
mActiveDofs : [0, 0, 0, 0, 1, 1, 0, 0, 0]
rb : [9](0,0,5,-10,9.88098e-15,1.11022e-14,-5,-10,0)
rDofNum : 16386624
rResidualSolutionNorm : 9.06342e-12
NR-Strategy: Convergence achieved after 4 / 20 iterations
BuilderAndSolver: Clear Function called
BuilderAndSolver: Clear Function called
.
----------------------------------------------------------------------
Ran 1 test in 0.799s

OK

FAILING RUN:

rrossi@Latitude-E5450:~/kratos_git/applications/StructuralMechanicsApplication/tests$ python3 test_linear_constraints.py 
 |  /           |
 ' /   __| _` | __|  _ \   __|
 . \  |   (   | |   (   |\__ \
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 7.0.0-62673851f3-FullDebug
Compiled with OpenMP support.
Maximum OpenMP threads: 4.
Importing    KratosStructuralMechanicsApplication 
    KRATOS   ___|  |                   |                   |
           \___ \  __|  __| |   |  __| __| |   |  __| _` | |
                 | |   |    |   | (    |   |   | |   (   | |
           _____/ \__|_|   \__,_|\___|\__|\__,_|_|  \__,_|_| MECHANICS
Initializing KratosStructuralMechanicsApplication...
Setup Dofs Time: 0.00329241
Setup System Time: 0.00341786
System Matrix Resize Time: 0.0199876
System Construction Time: 0.0269566
mActiveDofs : [0, 0, 0, 0, 0, 0, 0, 0, 0]
rb : [9](0,0,-6.47151e-16,-10,-1.59465e-15,-10,-6.47151e-16,-10,0)
rDofNum : 46598720
rResidualSolutionNorm : 0
SolvingStrategy:  MESH MOVED 
ResidualBasedBlockBuilderAndSolver: Build time: 0.00351175
ResidualBasedBlockBuilderAndSolver: System solve time: 0.00676763
SolvingStrategy:  MESH MOVED 
mActiveDofs : [0, 0, 0, 0, 0, 0, 0, 0, 0]
rb : [9](0,0,4.90753,-10,4.10783e-15,-0.184948,-4.90753,-10,0)
rDofNum : 46598720
rResidualSolutionNorm : 0
NR-Strategy: Convergence achieved after 1 / 20 iterations
BuilderAndSolver: Clear Function called
FBuilderAndSolver: Clear Function called

======================================================================
FAIL: test_constraints (__main__.TestLinearConstraints)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_linear_constraints.py", line 185, in test_constraints
    self.assertTrue(tang_2.norm_2() < 1e-12)
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.277s

FAILED (failures=1)
RiccardoRossi commented 4 years ago

ok, i think i understad what is happening.

ActiveDofs is a std::vector . That is a evil container, in that it packs multiple bools into the same integer. I think in this context the loops

    #pragma omp parallel for
    for(int i=0; i<static_cast<int>(rActiveDofs.size()); ++i) {
        rActiveDofs[i] = true;
    }

    #pragma omp parallel for
    for (int i = 0; i<static_cast<int>(rDofSet.size()); ++i) {
        const auto it_dof = rDofSet.begin() + i;
        if (it_dof->IsFixed()) {
            rActiveDofs[it_dof->EquationId()] = false;
        }
    }

are not safe to be done in parallel, even though they would if we would use for example vector

RiccardoRossi commented 4 years ago

i think we have a fix in #6160

...nasty one...

RiccardoRossi commented 4 years ago

@KratosMultiphysics/technical-committee i think we should backport this one to the release...

roigcarlo commented 4 years ago

Added to the milestone

RiccardoRossi commented 4 years ago

so parallelization behaviour is ok.

we still have mpi inefficiencies of ResidualCriteria

philbucher commented 4 years ago

@RiccardoRossi I think this is solved isn't it?

RiccardoRossi commented 4 years ago

yes. I think it is open as the implementation is VERY INEFFICIENT in mpi