When running cases in 3D, ~55%-60% of the time is spent on the diagonal computation, which is even more than the linear solver (32% + 15% for GMG) requires. The reason is probably that we call the local_vmult action per dof on each cell instead of each cell, as usual.
Replacing the MatrixFreeTools implementation (I don't have any hanging nodes) by the simplest possible solution I am aware of boils it down to 45% of the total time, which is still too much much in my opinion
data->template cell_loop<VectorType, int>(
[this](const auto &data, auto &dst, const auto &, const auto &cell_range) {
// Initialize data structures
const VectorizedArrayType one = make_vectorized_array<Number>(1.);
const VectorizedArrayType zero = make_vectorized_array<Number>(0.);
FECellIntegrator phi(data);
AlignedVector<VectorizedArrayType> local_diagonal_vector(
phi.dofs_cell;
for (unsigned int cell = cell_range.first; cell < cell_range.second;
++cell)
{
phi.reinit(cell);
// Loop over all DoFs and set dof values to zero everywhere but i-th
// DoF. With this input (instead of read_dof_values()) we do the
// action and store the result in a diagonal vector
for (unsigned int i = 0; i < phi.dofs_per_cell; ++i)
{
for (unsigned int j = 0; j < phi.dofs_per_cell; ++j)
phi.begin_dof_values()[j] = zero;
phi.begin_dof_values()[i] = one;
do_operation_on_cell(phi);
local_diagonal_vector[i] = phi.begin_dof_values()[i];
}
// Cannot handle hanging nodes
for (unsigned int i = 0; i < phi.dofs_per_cell; ++i)
phi.begin_dof_values()[i] = local_diagonal_vector[i];
phi.distribute_local_to_global(dst);
}
},
inverse_diagonal_vector,
dummy);
Maybe I can get rid of the most inner call to do_operation_on_cell, i.e., pull it out of the loop here. This requires some investigations.
When running cases in 3D,
~55%-60%
of the time is spent on the diagonal computation, which is even more than the linear solver (32%
+15%
for GMG) requires. The reason is probably that we call thelocal_vmult
action per dof on each cell instead of each cell, as usual.Replacing the
MatrixFreeTools
implementation (I don't have any hanging nodes) by the simplest possible solution I am aware of boils it down to45%
of the total time, which is still too much much in my opinionMaybe I can get rid of the most inner call to
do_operation_on_cell
, i.e., pull it out of the loop here. This requires some investigations.