Use StackBased Vectors and Matrices in ConstitutiveLaw interface

RiccardoRossi commented 3 years ago

Description In @KratosMultiphysics/technical-committee we are in principle favorable in changing the interface of ConstitutiveLaw to use StackBased vectors (of variable size) for Strain,Stress and F instead of heap allocated vector.

this would imply a API breaker change in the ConstitutiveLawParameters

to be discussed further for "details"

loumalouomega commented 3 years ago

I am in favour. BTW I suggested something similar some time ago using bounding vectors and matrices using a template. Which class is StackBased vector of variable class?

pooyan-dadvand commented 2 years ago

@KratosMultiphysics/technical-committee forwards this to the @KratosMultiphysics/implementation-committee

AlejandroCornejo commented 2 years ago

I'm pushing this issue now. I am working in the assessing-bounded-vector-in-cl branch. My idea is to initially modify the ElasticIsotropic3DLaw and see the performance change for building a case. I do this representing the @KratosMultiphysics/implementation-committee

In the meantime, I kindly ask @RiccardoRossi ,@philbucher and @loumalouomega to discuss a bit which should be the performance test to effectively see if it is worth the change or not.

Thank you!

loumalouomega commented 2 years ago

I'm pushing this issue now. I am working in the assessing-bounded-vector-in-cl branch. My idea is to initially modify the ElasticIsotropic3DLaw and see the performance change for building a case. I do this representing the @KratosMultiphysics/implementation-committee

In the meantime, I kindly ask @RiccardoRossi ,@philbucher and @loumalouomega to discuss a bit which should be the performance test to effectively see if it is worth the change or not.

Thank you!

I am in favor of using the simplest possible, but yet complete. A patch test should be fine. And measure times with vtune (now it's free)

AlejandroCornejo commented 2 years ago

Do you remember any exmaple that I can run? Like a test? I suppose that in c++ is better to asses performance

loumalouomega commented 2 years ago

Do you remember any exmaple that I can run? Like a test? I suppose that in c++ is better to asses performance

I think there is already a c++ patch test, and if not it is relatively easy to create. The point what I said of doing a complete simulation is because during the short time I worked in explicit I discovered many things are called many times, that in the implicit case the penalty is small, but for explicit is a lot, and you can see that only if you do a whole simulation.

Short answer: Python will be fine, as long as we use Vtune to measure times and check differences

RiccardoRossi commented 2 years ago

The proposal i made to Alejandro is to generate a few millinos of simple structural elements, and to call in parallel the function CalculateSystemContributions onto them.

i expect a nonnegligible performance uplift...

loumalouomega commented 2 years ago

The proposal i made to Alejandro is to generate a few millinos of simple structural elements, and to call in parallel the function CalculateSystemContributions onto them.

i expect a nonnegligible performance uplift...

Yes, that will give us a hint. But for a real performance impact study we should do a complete simulation (there all the calls that are supposed to be done will be done, and we will see real performance effect). Again, this is from the perspective I acquired when working in explicit

AlejandroCornejo commented 2 years ago

This is the test:

// KRATOS  ___|  |                   |                   |
//       \___ \  __|  __| |   |  __| __| |   |  __| _` | |
//             | |   |    |   | (    |   |   | |   (   | |
//       _____/ \__|_|   \__,_|\___|\__|\__,_|_|  \__,_|_| MECHANICS
//
//  License:         BSD License
//                   license: structural_mechanics_application/license.txt
//
//  Main authors:    Riccardo Rossi
//

// Project includes
#include "containers/model.h"
#include "testing/testing.h"
#include "structural_mechanics_application_variables.h"
#include "custom_elements/small_displacement.h"
#include "utilities/parallel_utilities.h"
#include "utilities/builtin_timer.h"

namespace Kratos
{
namespace Testing
{

    KRATOS_TEST_CASE_IN_SUITE(TotalLagrangian2D3N, KratosStructuralMechanicsFastSuite)
    {
        Model current_model;
        auto &r_model_part = current_model.CreateModelPart("ModelPart",1);
        r_model_part.GetProcessInfo().SetValue(DOMAIN_SIZE, 2);

        r_model_part.AddNodalSolutionStepVariable(DISPLACEMENT);
        r_model_part.AddNodalSolutionStepVariable(VOLUME_ACCELERATION);

        // Set the element properties
        auto p_elem_prop = r_model_part.CreateNewProperties(0);
        p_elem_prop->SetValue(YOUNG_MODULUS, 2.0e+06);
        p_elem_prop->SetValue(POISSON_RATIO, 0.3);
        p_elem_prop->SetValue(THICKNESS, 0.01);
        const auto &r_clone_cl = KratosComponents<ConstitutiveLaw>::Get("LinearElastic3DLaw");
        p_elem_prop->SetValue(CONSTITUTIVE_LAW, r_clone_cl.Clone());

        // Constants for the computation of the stress
        const double E = p_elem_prop->GetValue(YOUNG_MODULUS);
        const double NU = p_elem_prop->GetValue(POISSON_RATIO);

        // Create the test element
        auto p_node_1 = r_model_part.CreateNewNode(1, 0.0000000000, 0.0100000000, 0.0100000000);
        auto p_node_2 = r_model_part.CreateNewNode(2, 0.0000000000, 0.0000000000, 0.0100000000);
        auto p_node_3 = r_model_part.CreateNewNode(3, 0.0000000000, 0.0100000000, 0.0000000000);
        auto p_node_4 = r_model_part.CreateNewNode(4, 0.0100000000, 0.0100000000, 0.0100000000);
        auto p_node_5 = r_model_part.CreateNewNode(5, 0.0100000000, 0.0000000000, 0.0100000000);
        auto p_node_6 = r_model_part.CreateNewNode(6, 0.0100000000, 0.0100000000, 0.0000000000);
        auto p_node_7 = r_model_part.CreateNewNode(7, 0.0000000000, 0.0000000000, 0.0000000000);
        auto p_node_8 = r_model_part.CreateNewNode(8, 0.0100000000, 0.0000000000, 0.0000000000);

        for (auto& r_node : r_model_part.Nodes()){
            r_node.AddDof(DISPLACEMENT_X);
            r_node.AddDof(DISPLACEMENT_Y);
            r_node.AddDof(DISPLACEMENT_Z);
        }

        std::vector<ModelPart::IndexType> element_nodes {4,1,3,6,5,2,7,8};
        for (int i = 1; i < 1e7; i++) // we create 1M elements
            auto p_element = r_model_part.CreateNewElement("SmallDisplacementElement3D8N", i, element_nodes, p_elem_prop);

        for (auto& r_elem : r_model_part.Elements()){
            r_elem.Initialize(r_model_part.GetProcessInfo());
        }

        struct my_tls {
            Vector mVec;
            Matrix mMat;
        };
        const auto& const_procinfo_ref = r_model_part.GetProcessInfo();

        BuiltinTimer setup_system_time;
        block_for_each(r_model_part.Elements(), my_tls(),  [&const_procinfo_ref](Element& r_elem, my_tls & MyTls) {
            r_elem.CalculateLocalSystem(MyTls.mMat, MyTls.mVec, const_procinfo_ref);
        });
        std::cout << "Build Time: " << setup_system_time.ElapsedSeconds() << std::endl;
    }
}
}

loumalouomega commented 2 years ago

This is the test:

As I said, in this test we don't fully see the influence. It will be nice to also measure differences with a full simulation, considering numerous elements

loumalouomega commented 2 years ago

This is the test:

As I said, in this test we don't fully see the influence. It will be nice to also measure differences with a full simulation, considering numerous elements

BTW, this is using a quite simple CL, maybe in more complex laws the difference is more significant. (5% in a comment you removed)

loumalouomega commented 1 year ago

@KratosMultiphysics/implementation-committee decided that in order to properly track the performance of the changes, we should in first place implement a proper benchmark utility (@matekelemen has some works in this regard). Then @AlejandroCornejo could test the performance changes in the simplest linear CL.

AlejandroCornejo commented 1 year ago

Just to complement, I did the test but with Visual Studio compiler... I should redo in Linux

loumalouomega commented 1 year ago

@KratosMultiphysics/implementation-committee decided that in order to properly track the performance of the changes, we should in first place implement a proper benchmark utility (@matekelemen has some works in this regard). Then @AlejandroCornejo could test the performance changes in the simplest linear CL.

We also spoke about declaring some member variables as static in order to reduce allocation/deallocation time. Not 100% related, but interesting.

AlejandroCornejo commented 1 year ago

I have been working on this today (assessing-bounded-vector-in-cl branch) and I could draw the following conclusions:

Compiling in linux by means of WSL Ubuntu

My test consists in creating in c++ 1M linear hexas with elastic 3D law and call the CalculateLocalSystem and check the build time.

Using Matrix and Vector, the build time is on average 6.4 s. (noalias( rLeftHandSideMatrix ) += IntegrationWeight * prod( trans( rB ), Matrix(prod(rD, rB)));)
Using Bounded Vector and Bounded Matrix (predefined size for this particular case) 2.9 s

With the Bounded I did the multiplications as:

    BoundedMatrix<double, 6, 24> aux;
    BoundedMatrix<double, 24, 6> aux2;
    noalias(aux2) = trans(rB);
    noalias(aux) = prod(IntegrationWeight *rD, rB);
    noalias( rLeftHandSideMatrix ) += prod(aux2, aux);

This means that we can reduce build times in about a 50%!!

AlejandroCornejo commented 1 year ago

I have also seen that the matrix.clear() is faster than noalias(matrix) = ZeroMAtrix(); Can this be?

loumalouomega commented 1 year ago

I have been working on this today (assessing-bounded-vector-in-cl branch) and I could draw the following conclusions:

Compiling in linux by means of WSL Ubuntu

My test consists in creating in c++ 1M linear hexas with elastic 3D law and call the CalculateLocalSystem and check the build time.
* Using `Matrix` and `Vector`, the build time is on average **6.4 s.** `(noalias( rLeftHandSideMatrix ) += IntegrationWeight * prod( trans( rB ), Matrix(prod(rD, rB)));)`

* Using `Bounded Vector` and `Bounded Matrix` (predefined size for this particular case) **2.9 s**
With the Bounded I did the multiplications as:
    BoundedMatrix<double, 6, 24> aux;
    BoundedMatrix<double, 24, 6> aux2;
    noalias(aux2) = trans(rB);
    noalias(aux) = prod(IntegrationWeight *rD, rB);
    noalias( rLeftHandSideMatrix ) += prod(aux2, aux);
This means that we can reduce build times in about a 50%!!

This would be even faster using the MathUtils operation

AlejandroCornejo commented 1 year ago

This was my idea too, reality is that is by far slower! :S

loumalouomega commented 1 year ago

I have also seen that the matrix.clear() is faster than noalias(matrix) = ZeroMAtrix(); Can this be?

Yes, because clear is an internal operation and in the other case you are creating a new matrix and copying values

loumalouomega commented 1 year ago

This was my idea too, reality is that is by far slower! :S

This is strange...

AlejandroCornejo commented 1 year ago

With sympy generation of the code (BtDB) is also slightly slower than the bounded matrix multiplication

loumalouomega commented 1 year ago

With sympy generation of the code (BtDB) is also slightly slower than the bounded matrix multiplication

May be related with some extension optimization of Ublas...

pooyan-dadvand commented 1 year ago

Because in both cases you are not using the vectorization but with prod you may use it depending on size of the matrices

rubenzorrilla commented 1 year ago

@KratosMultiphysics/implementation-committee from the @KratosMultiphysics/technical-committee we would like to know how this is evolving. Thanks.

AlejandroCornejo commented 1 year ago

Well I am mostly in charge of this task within the @KratosMultiphysics/implementation-committee . I draw some conclusions and I would be happy to have a discussion regarding this in a meeting to know if this is something to be done or not.

KratosMultiphysics / Kratos

Use StackBased Vectors and Matrices in ConstitutiveLaw interface #8171