Improve matrix-free performance marginally.

@tjhei This is one for you, pointed out to me by @marcfehling .

In stokes_matrix_free.cc, we apply the block preconditioner in the following way (abridged):

    void
    BlockSchurGMGPreconditioner<StokesMatrixType, ABlockMatrixType, SchurComplementMatrixType,
                                ABlockPreconditionerType, SchurComplementPreconditionerType>::
                                vmult (dealii::LinearAlgebra::distributed::BlockVector<double>       &dst,
                                       const dealii::LinearAlgebra::distributed::BlockVector<double>  &src) const
    {
      dst = 0.0;    // set all components of dst to zero

      // either solve with the Schur complement matrix (if do_solve_Schur_complement==true)
      // or just apply one preconditioner sweep (for the first few
      // iterations of our two-stage outer GMRES iteration)
      if (do_solve_Schur_complement)
        {
             ...
             dst.block(1) = S^{-1} src.block(1);
             ...
        }
      else
        {
          ...same as above, but only one preconditioner application instead of S^{-1}...
        }

      dst.block(1) *= -1.0;

      {
        // the u-block of dst only contains zeros
        stokes_matrix.vmult(utmp, dst); // B^T
        utmp.block(0) *= -1.0;
        utmp.block(0) += src.block(0);
      }

The key piece is that at the beginning of this last block, dst.block(0) is zero and so what ends up in utmp is really just the product of the top right and bottom right block of the Stokes matrix with dst.block(1). The top left and bottom left blocks of the matrix are multiplied by zeros.

The inefficiency now is that the vast majority of the work done in the stokes matrix (or stokes operator) is associated with the top left block, which is unnecessary here. It would be nice to either restrict operation to individual blocks, or to create a separate matrix-free object that only represents the top right and bottom right blocks.

geodynamics / aspect

Improve matrix-free performance marginally. #5570