Diagonal computation is too inefficient

When running cases in 3D, ~55%-60% of the time is spent on the diagonal computation, which is even more than the linear solver (32% + 15% for GMG) requires. The reason is probably that we call the local_vmult action per dof on each cell instead of each cell, as usual.

Replacing the MatrixFreeTools implementation (I don't have any hanging nodes) by the simplest possible solution I am aware of boils it down to 45% of the total time, which is still too much much in my opinion

  data->template cell_loop<VectorType, int>(
    [this](const auto &data, auto &dst, const auto &, const auto &cell_range) {
      // Initialize data structures
      const VectorizedArrayType one  = make_vectorized_array<Number>(1.);
      const VectorizedArrayType zero = make_vectorized_array<Number>(0.);
      FECellIntegrator          phi(data);
      AlignedVector<VectorizedArrayType> local_diagonal_vector(
        phi.dofs_cell;
      for (unsigned int cell = cell_range.first; cell < cell_range.second;
           ++cell)
        {
          phi.reinit(cell);
          // Loop over all DoFs and set dof values to zero everywhere but i-th
          // DoF. With this input (instead of read_dof_values()) we do the
          // action and store the result in a diagonal vector
          for (unsigned int i = 0; i < phi.dofs_per_cell; ++i)
              {
                for (unsigned int j = 0; j < phi.dofs_per_cell; ++j)
                  phi.begin_dof_values()[j] = zero;

                phi.begin_dof_values()[i] = one;
                do_operation_on_cell(phi);
                local_diagonal_vector[i] = phi.begin_dof_values()[i];
              }

          // Cannot handle hanging nodes
          for (unsigned int i = 0; i < phi.dofs_per_cell; ++i)
            phi.begin_dof_values()[i] = local_diagonal_vector[i];
          phi.distribute_local_to_global(dst);
        }
    },
    inverse_diagonal_vector,
    dummy);

Maybe I can get rid of the most inner call to do_operation_on_cell, i.e., pull it out of the loop here. This requires some investigations.

davidscn / matrix-free-dealii-precice

Diagonal computation is too inefficient #46