KratosMultiphysics / Kratos

Kratos Multiphysics (A.K.A Kratos) is a framework for building parallel multi-disciplinary simulation software. Modularity, extensibility and HPC are the main objectives. Kratos has BSD license and is written in C++ with extensive Python interface.
https://kratosmultiphysics.github.io/Kratos/
Other
1.04k stars 246 forks source link

[Core] Race condition in SetValue operation during geometry lock #8916

Closed marcnunezc closed 3 years ago

marcnunezc commented 3 years ago

Description A few days ago a potential flow test was failing in the CI. The error trace was something like this:

Thread 5 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f46bf7fc900 (LWP 4903)]
Kratos::VariableData::SourceKey (this=0x0) at /root/Kratos/kratos/containers/variable_data.h:194
194             return mpSourceVariable->mKey;
(gdb) up
#1  0x00007f46cfa63a14 in Kratos::DataValueContainer::IndexCheck::operator() (this=0x7f46bf7fb800, I={...}) at /root/Kratos/kratos/containers/data_value_container.h:354
354                 return I.first->SourceKey() == mI;
(gdb) up
#2  0x00007f46cf7af620 in __gnu_cxx::__ops::_Iter_pred<Kratos::DataValueContainer::IndexCheck>::operator()<__gnu_cxx::__normal_iterator<std::pair<Kratos::VariableData const*, void*> const*, std::vector<std::pair<Kratos::VariableData const*, void*>, std::allocator<std::pair<Kratos::VariableData const*, void*> > > > > (this=0x7f46bf7fb800, __it={first = 0x0, second = 0x3667580})
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/predefined_ops.h:283
283             { return bool(_M_pred(*__it)); }
(gdb) up
#3  0x00007f46cf7af4f3 in std::__find_if<__gnu_cxx::__normal_iterator<std::pair<Kratos::VariableData const*, void*> const*, std::vector<std::pair<Kratos::VariableData const*, void*>, std::allocator<std::pair<Kratos::VariableData const*, void*> > > >, __gnu_cxx::__ops::_Iter_pred<Kratos::DataValueContainer::IndexCheck> > (__first={first = 0x0, second = 0x3667580}, __last={first = 0x40, second = 0x61}, __pred=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_algo.h:140
140               if (__pred(__first))
(gdb) up
#4  0x00007f46cf7af356 in std::__find_if<__gnu_cxx::__normal_iterator<std::pair<Kratos::VariableData const*, void*> const*, std::vector<std::pair<Kratos::VariableData const*, void*>, std::allocator<std::pair<Kratos::VariableData const*, void*> > > >, __gnu_cxx::__ops::_Iter_pred<Kratos::DataValueContainer::IndexCheck> > (__first={first = 0x0, second = 0x3667580}, __last={first = 0x40, second = 0x61}, __pred=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_algo.h:161
161           return __find_if(__first, __last, __pred,
(gdb) up
#5  0x00007f46cf7af2da in std::find_if<__gnu_cxx::__normal_iterator<std::pair<Kratos::VariableData const*, void*> const*, std::vector<std::pair<Kratos::VariableData const*, void*>, std::allocator<std::pair<Kratos::VariableData const*, void*> > > >, Kratos::DataValueContainer::IndexCheck> (__first={first = 0x0, second = 0x3667580}, __last={first = 0x40, second = 0x61}, __pred=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_algo.h:3963
3963          return std::__find_if(__first, __last,
(gdb) up
#6  0x00007f46cf7b3fb6 in Kratos::DataValueContainer::GetValue<bool> (this=0x343ad98, rThisVariable=...) at /root/Kratos/kratos/containers/data_value_container.h:210
210             if ((i = std::find_if(mData.begin(), mData.end(), IndexCheck(rThisVariable.SourceKey())))  != mData.end())
(gdb) up
#7  0x00007f46cf536716 in Kratos::Node<3ul, Kratos::Dof<double> >::GetValue<Kratos::Variable<bool> > (this=0x343ad20, rThisVariable=...) at /root/Kratos/kratos/includes/node.h:507
507             return mData.GetValue(rThisVariable);
(gdb) up
#8  0x00007f46cf532904 in Kratos::PotentialFlowUtilities::GetPotentialOnNormalElement<2, 3> (rElement=...)
    at /root/Kratos/applications/CompressiblePotentialFlowApplication/custom_utilities/potential_flow_utilities.cpp:41
41                  if (!r_geometry[i].GetValue(TRAILING_EDGE)) {
(gdb) up

https://github.com/KratosMultiphysics/Kratos/blob/7c86ef788ebd32a25591f8a364be9eb57a974593/applications/CompressiblePotentialFlowApplication/custom_utilities/potential_flow_utilities.cpp#L41

which is called in here in ComputeVelocity():

https://github.com/KratosMultiphysics/Kratos/blob/7c86ef788ebd32a25591f8a364be9eb57a974593/applications/CompressiblePotentialFlowApplication/custom_elements/compressible_potential_flow_element.cpp#L852-L858

Which is called just after ComputePotentialJump()

https://github.com/KratosMultiphysics/Kratos/blob/7c86ef788ebd32a25591f8a364be9eb57a974593/applications/CompressiblePotentialFlowApplication/custom_elements/compressible_potential_flow_element.cpp#L154-L158 which does some operation doing a Lock(): https://github.com/KratosMultiphysics/Kratos/blob/7c86ef788ebd32a25591f8a364be9eb57a974593/applications/CompressiblePotentialFlowApplication/custom_elements/compressible_potential_flow_element.cpp#L836-L848

This has been solved in #8904 by removing the conflicting code from the element (which was needed to be done anyway due to code duplication and the fact that it is a postprocess quantity).

After further debugging with @roigcarlo it turns out that the issue would have been solved as well by properly initializing the POTENTIAL_JUMP value before starting the analysis, e.g:

VariableUtils().SetNonHistoricalVariableToZero(POTENTIAL_JUMP, root_model_part.Nodes());

By adding this line at some point before the start of the analysis, the test works fine.

The point to discuss is:

Is it expected that a race condition occurs with SetValue in a SetLock/UnsetLock operation? We would expect that the SetLock/UnsetLock would prevent the race condition from happening even if the variable was not set beforehand. Any thoughts?

Scope Where in Kratos is this bug happening. E.g.

To Reproduce Compile cps/ci-test-1 branch and run test_CompressiblePotentialFlowApplication.py tests with as many threads as possible.

Expected behavior To not get a race condition if the locks are used.

Environment Please specify your operating system, kratos branch, python version or any other library versions that you consider relevant. E.g.:

marcnunezc commented 3 years ago

FYI @inigolcanalejo

inigolcanalejo commented 3 years ago

Ah true... I should have thought about that. It is not the first time that I had this problem. As far as I understand if one initializes the variable, the memory gets allocated and the result is not random when accessing its values in parallel. If one doesn't initialize the variable, accessing its value gives random behavior in parallel. @jcotela and @sunethwarna explained it to me a few years back.

philbucher commented 3 years ago

no clue TBH why this is happing. I checked the code several times in detail, it seems fine to me.

As you wrote this should be fine since you lock it properly Also I don't understand why the initialization changes it, this should not matter IMO (and it is designed to work without)

Anyway, the @KratosMultiphysics/technical-committee is aware, thx for reporting

philbucher commented 3 years ago

When did you add this code? It looks quite standard to me, I am pretty sure there are many more places where this is used.

marcnunezc commented 3 years ago

When did you add this code? It looks quite standard to me, I am pretty sure there are many more places where this is used.

This was added in https://github.com/KratosMultiphysics/Kratos/pull/3641 when first creating the element.

RiccardoRossi commented 3 years ago

hi Guys, is this problem solved once you initialize the variables outside of the parallel loop? looking into the discussion we have the impression it is now solved, however if it is not it would be important to create a small mockup test (self contained, no elements etc) which reproduces the same type of behaviour

marcnunezc commented 3 years ago

The problem is completely solved if the variables are initialized outside of the parallel loop.

I left the issue open due to the open question:

Is it expected that a race condition occurs with SetValue in a SetLock/UnsetLock operation? (inside a parallel loop)

If it is expected then we can close this.

RiccardoRossi commented 3 years ago

@marcnunezc you mean to initialize them in a separate parallel loop, correct? We @KratosMultiphysics/technical-committee would give as a feedback that it is always a good practice to do that before as miximing initialization and parallel operations is normally a very bad idea.

In any case we will close as there is no way to have that level of threadsafety without adding a lock before every get

inigolcanalejo commented 3 years ago

@marcnunezc you mean to initialize them in a separate parallel loop, correct?

Yes, for me at least initializing them in a separate previous parallel loop solves the problem.

We @KratosMultiphysics/technical-committee would give as a feedback that it is always a good practice to do that before as miximing initialization and parallel operations is normally a very bad idea.

In any case we will close as there is no way to have that level of threadsafety without adding a lock before every get

:thumbsup: Is this written somewhere in the wiki? If not, it might be a good idea to add it so that future users and developers are aware of this. This gave me a lot of problems in the past, I think know I finally got that the variables need to be initialized before doing parallel operations :sweat_smile:

Thanks @roigcarlo also for his detailed explanation! https://github.com/KratosMultiphysics/Kratos/issues/8924#issuecomment-878903178