[ShapeOpt] Gradient calculation much slower in Windows compared to Linux

dbaumgaertner commented 6 years ago

As gradient calculation invovles a lot of matrix-vector-operations it is assumed that this problem is related to the different blas and lapack routines or the given ublas.
Might vanish as soon as AMatrix is comming

ToDo:

[x] Create minimal example, profile functions and locate problematic operations
[x] Check if all response functions are not slowed down unnecessarily.

philbucher commented 6 years ago

100th issue :tada:

armingeiser commented 6 years ago

I did some tests, where i simply use the same operations we do in the gradient calculation - without the calculation of a real stiffness matrix. Windows indeed is a little slower.

#############################################
#############################################
#############################################
# UBUNTU
#############################################
#############################################
#############################################

# test_RHS_operations | n 1000000 | size 20
# 1.37613e-263
# Time 0.7232062816619873

# test_RHS_operations_expensive | n 1000000 | size 20
# 2.57576e+07
# Time 0.36026930809020996

# test_RHS_operations | n 1000000 | size 20
# 2.57576e+07
# Time 0.08381533622741699

# test_RHS_operations_expensive | n 1000000 | size 20
# 2.57576e+07
# Time 0.37224411964416504

# test_RHS_inside_loop | n 1000000 | size 20
# 1e+06
# Time 0.015364885330200195

# test_RHS_outside_loop | n 1000000 | size 20
# 1e+06
# Time 0.005693912506103516

#############################################
#############################################
#############################################

# test_RHS_operations | n 1000000 | size 100
# -nan
# Time 0.4674644470214844

# test_RHS_operations_expensive | n 1000000 | size 100
# -nan
# Time 8.594015836715698

# test_RHS_operations | n 1000000 | size 100
# -nan
# Time 0.39067959785461426

# test_RHS_operations_expensive | n 1000000 | size 100
# -nan
# Time 8.604081869125366

# test_RHS_inside_loop | n 1000000 | size 100
# 1e+06
# Time 0.016034603118896484

# test_RHS_outside_loop | n 1000000 | size 100
# 1e+06
# Time 0.01217198371887207

#############################################
#############################################
#############################################
# WINDOWS
#############################################
#############################################
#############################################

# test_RHS_operations | n 1000000 | size 20
# -2.31527e+263
# Time 0.19014501571655273

# test_RHS_operations_expensive | n 1000000 | size 20
# -2.37987e+263
# Time 0.6268339157104492

# test_RHS_operations | n 1000000 | size 20
# 5.5777e+261
# Time 0.1262662410736084

# test_RHS_operations_expensive | n 1000000 | size 20
# 0
# Time 0.5714337825775146

# test_RHS_inside_loop | n 1000000 | size 20
# 1e+06
# Time 0.0663764476776123

# test_RHS_outside_loop | n 1000000 | size 20
# 1e+06
# Time 0.012975215911865234

#############################################
#############################################
#############################################

# test_RHS_operations | n 1000000 | size 100
# -nan
# Time 0.6218540668487549

# test_RHS_operations_expensive | n 1000000 | size 100
# 9.79798e+07
# Time 11.166205167770386

# test_RHS_operations | n 1000000 | size 100
# 4.89899e+07
# Time 0.451171875

# test_RHS_operations_expensive | n 1000000 | size 100
# 9.79798e+07
# Time 11.172195196151733

# test_RHS_inside_loop | n 1000000 | size 100
# 1e+06
# Time 0.06638073921203613

# test_RHS_outside_loop | n 1000000 | size 100
# 1e+06
# Time 0.02345442771911621

dbaumgaertner commented 5 years ago

Thanks for that detailed study! I figured out, that the recent performance drop has nothing to do with windows or linux. It comes from the new implementation of the gradient calculation using ElementFiniteDifferenceUtility::CalculateRightHandSideDerivative.

Adjusting StrainEnergyResponseFunctionUtility::CalculateStateDerivativePartByFiniteDifferencing to the original implementation we had leads to a drastic increase in performance. See the following code and note the commented / exchanged lines in the first for loop over the nodes.

void CalculateStateDerivativePartByFiniteDifferencing()
{
KRATOS_TRY;

// Working variables
ProcessInfo &CurrentProcessInfo = mrModelPart.GetProcessInfo();

// Computation of: \frac{1}{2} u^T \cdot ( - \frac{\partial K}{\partial x} )
for (auto& elem_i : mrModelPart.Elements())
{
    Vector u;
    Vector lambda;
    Vector RHS;

    // Get state solution
    elem_i.GetValuesVector(u,0);

    // Get adjoint variables (Corresponds to 1/2*u)
    lambda = 0.5*u;

    // Semi-analytic computation of partial derivative of state equation w.r.t. node coordinates
    elem_i.CalculateRightHandSide(RHS, CurrentProcessInfo);
    for (auto& node_i : elem_i.GetGeometry())
    {
        // array_3d gradient_contribution(3, 0.0);
        // Vector derived_RHS = Vector(0);

        // // x-direction
        // ElementFiniteDifferenceUtility::CalculateRightHandSideDerivative(elem_i, SHAPE_X, node_i, mDelta, derived_RHS, CurrentProcessInfo);
        // gradient_contribution[0] = inner_prod(lambda, derived_RHS);

        // // y-direction
        // ElementFiniteDifferenceUtility::CalculateRightHandSideDerivative(elem_i, SHAPE_Y, node_i, mDelta, derived_RHS, CurrentProcessInfo);
        // gradient_contribution[1] = inner_prod(lambda, derived_RHS);

        // // z-direction
        // ElementFiniteDifferenceUtility::CalculateRightHandSideDerivative(elem_i, SHAPE_Z, node_i, mDelta, derived_RHS, CurrentProcessInfo);
        // gradient_contribution[2] = inner_prod(lambda, derived_RHS);

        // // Assemble sensitivity to node
        // noalias(node_i.FastGetSolutionStepValue(SHAPE_SENSITIVITY)) += gradient_contribution;

        array_3d gradient_contribution(3, 0.0);
        Vector perturbed_RHS = Vector(0);

        // Pertubation, gradient analysis and recovery of x
        node_i.X0() += mDelta;
        elem_i.CalculateRightHandSide(perturbed_RHS, CurrentProcessInfo);
        gradient_contribution[0] = inner_prod(lambda, (perturbed_RHS - RHS) / mDelta);
        node_i.X0() -= mDelta;

        // Reset pertubed vector
        perturbed_RHS = Vector(0);

        // Pertubation, gradient analysis and recovery of y
        node_i.Y0() += mDelta;
        elem_i.CalculateRightHandSide(perturbed_RHS, CurrentProcessInfo);
        gradient_contribution[1] = inner_prod(lambda, (perturbed_RHS - RHS) / mDelta);
        node_i.Y0() -= mDelta;

        // Reset pertubed vector
        perturbed_RHS = Vector(0);

        // Pertubation, gradient analysis and recovery of z
        node_i.Z0() += mDelta;
        elem_i.CalculateRightHandSide(perturbed_RHS, CurrentProcessInfo);
        gradient_contribution[2] = inner_prod(lambda, (perturbed_RHS - RHS) / mDelta);
        node_i.Z0() -= mDelta;

        // Assemble sensitivity to node
        noalias(node_i.FastGetSolutionStepValue(SHAPE_SENSITIVITY)) += gradient_contribution;
    }
}

// Computation of \frac{1}{2} u^T \cdot ( \frac{\partial f_{ext}}{\partial x} )
for (auto& cond_i : mrModelPart.Conditions())
{
  ....
}

KRATOS_CATCH("");

}

@armingeiser As you implmeneted this new approach, can you please have a look into this. For sure we cannot sacrifice the performance here for some better code structure. Nevertheless, I hope there is a way to fix this, while maintaining your new structuring.

dbaumgaertner commented 5 years ago

Another thing I noticed. Whenever calling CalulcateRHS or functions like this the base_shell_element class is internally computing both the stiffness matrix and the RHS. Corresponding flags are internally set to true, regardeless of the specific function which is called.

Changing that as follows leads again to 25% performance gain in the gradient calculation:

void BaseShellElement::CalculateLeftHandSide(MatrixType& rLeftHandSideMatrix,
                                       ProcessInfo& rCurrentProcessInfo)
{
    // Calculation flags
    const bool calculate_stiffness_matrix_flag = true;
    const bool calculate_residual_vector_flag = false; // TODO check is this can be false => see solids

     Vector dummy;
     CalculateAll(rLeftHandSideMatrix, dummy, rCurrentProcessInfo,
     calculate_stiffness_matrix_flag, calculate_residual_vector_flag);
}

void BaseShellElement::CalculateRightHandSide(VectorType& rRightHandSideVector,
    ProcessInfo& rCurrentProcessInfo)
{
    // Calculation flags
    const bool calculate_stiffness_matrix_flag = false; // TODO check is this can be false => see solids
    const bool calculate_residual_vector_flag = true;

    Matrix dummy;
    CalculateAll(dummy, rRightHandSideVector, rCurrentProcessInfo,
    calculate_stiffness_matrix_flag, calculate_residual_vector_flag);
}

I don't know if this has other implications, but I guess this inconsistent behavior should be really changed because in the current master, both functions above do the same as "CalculateLocalSystem".

philbucher commented 5 years ago

@dbaumgaertner I see the problem with the shells, there is even a TODO in Master :) I just haven't had the time yet to properly test it

Could you try for your case if the results are different or if there are any other differences?

armingeiser commented 5 years ago

@dbaumgaertner wow thats interesting, thanks for figuring it out. Do you have numbers for the first issue with the ElementFiniteDifferenceUtility, meaning how much slower it is? I will have a closer look soon.

dbaumgaertner commented 5 years ago

@philbucher Well, all I can say is that all optimization tests are running. So the gradients I compute are the same. But I cant say if elsewhere there is a difference. I guess it makes sense, if the one who programmed the shell, quickly checks it. For sure its easy to say.

@armingeiser Running the shell example (test example 02) for one optimization step currently takes around 4.5s. Changing the implementation to the old one without ElementFiniteDifferenceUtility reduces it to 2s. Considering the correct flag settings as specified above reduces it further to 1.5s. And using the intel solver instead of super lu cuts it to 1.3s. And thats just a small example here.

armingeiser commented 5 years ago

@dbaumgaertner thanks, just that i know what i have to look for.

armingeiser commented 5 years ago

In the ElementFiniteDifferenceUtility we do the differenciation for each component independently - in each differenciation we need to make a call to CalculateRightHandSide for the unperturbed and the perturbed node. This leads to 6 calls per shape derivative.
In the current master is a TODO close to an additional final call of CalculateRightHandSide which seems not to be needed and increases the number of function calls to 9.
In the old version we only called the CalculateRightHandSide once for the unperturbed system, and once for each coordinate direction - 4 calls per node

The time spent directly correlates to the number of calls to the CalculateRightHandSide function.
- 9 calls: 3,4s
- 6 calls: 2.2s (== 3,4/9*6)
- 4 calls: 1.6s (== 1.6/9*4)
What makes it even slower is the calculation of the dummy LHS in the CalculateRightHandSide of the shell element. It needs to allocate a LHS matrix each time which makes this call very slow. If the mentioned flag is set to false:
- 9 calls: 2,2s
- 6 calls: 1.5s (== 2,2/9*6)
- 4 calls: 1.0s (== 2,2/9*4)

End of the story:

I need to find a way to make the ElementFiniteDifferenceUtility more efficient by reducing the number of calls to the CalculateRightHandSide function.
It would be beneficial if the dummy LHS would not need to be calculated for the shells, but thats not something i can do.

philbucher commented 5 years ago

But I cant say if elsewhere there is a difference. I guess it makes sense, if the one who programmed the shell, quickly checks it.

Thing is that they are no longer around I add it to my list of TODOS for cleaning the shell ...

dbaumgaertner commented 5 years ago

@armingeiser This is just a reminder to make sure that also for other response functions (eigenfrequency) unnecessary Calculate-calls need to be avoided!

armingeiser commented 5 years ago

All currently available responses are not calling "Calculate" more often than necessary.

KratosMultiphysics / Kratos

[ShapeOpt] Gradient calculation much slower in Windows compared to Linux #3082