@fverdugo @santiagobadia ... pre-meeting and in-meeting material for our next meeting (I will prepare a pre-class quiz, ;-) )

The main driver program is available here

Performance concerns

I am bit concerned about the current JIT compilation times (I did not systematically measured execution times). I have introduced some @time macro calls in the main driver source code for the hotspot lines of code. You can see the result here. The code implementing the conforming RT method for the Darcy problems takes 105 seconds to compile, while the one of the hybridized RT method 770 seconds. I would like to discuss and understand the causes and possible approaches to reduce current compilation times (if any).

Implementation approach

The weak form of the hybridized RT FE method for the Darcy problem has the following two integrals over the cells boundaries d∂K:

∫(mh*(uh⋅n))*d∂K
∫((vh⋅n)*lh)*d∂K

where mh (resp. lh) are the test (resp., trial) functions of the space of Lagrange multipliers, defined on the mesh facets, and (uh⋅n) (resp., vh⋅n) are the normal components of the trial functions (resp., test functions) of the non-conforming RT space, defined on the mesh cells. Note that the RT space is non-conforming, and thus the Piola Map is not required in order to build the pull back of the physical space RT functions (this has been confirmed in an experiment with the current code). Indeed, the pull-back for the non-conforming RT shape functions is just the composition of the reference space function and the inverse of the geometrical map, as with Lagrangian FEs.

The current implementation of the above two integral terms leverages (intensively) the Gridap BlockMap and ArrayBlock{T,N} data types. The latter is the type of the elements in the range of the former, and represents an array of blocks, all of type T. Some of these blocks might be empty.

The code ultimately expresses the integrals above as a sum of integrals over the cell boundary facets. Achieving this outcome is a multi-step process. In the sequel, I am assuming that the integrand has the form of a binary operation in the root of the operation tree (e.g, *, as above). I will refer to the left operand as to the "test" operand, and to the right operand as to the "trial" operand.

see here Assuming that the test operand is posed over a D-dim domain, we build the test operand as a cell-wise array of nested arrays of blocks indexed as test[c][f][b], where c is the global cell identifier, f is the local facet identifier, and b is the block identifier within the blocked layout of the system of PDEs (e.g., velocity->1, pressure->2, lagrange-). Each entry test[c][f][b] is typically either an array of Fields (before evaluation) or a matrix of Numbers (after evaluation). For a given c, we always get a non-empty block for any possible value of f. For a given c, and f, typically there is only a single non-empty block. For example, for vh above, test[c][f][b] contains the cell shape functions restricted to the facet with local identifier f.
see here Assuming that the trial operand is posed over a D-1-dim domain, we build the trial operand as a cell-wise array of nested arrays of blocks indexed as trial[c][f1][1,b][1,f2], where c is the global cell identifier, f1 and f2 are local facet identifiers, and b is the block identifier. Note that the operand has an extra level of nested blocks at the end. In this level, only one block is non-empty, in particular the one for which the value of f1 and f2 match. This extra level is required in order to have the columns of the matrix correctly laid out (otherwise we would be incorrectly summing up all the contributions from all facets altogether in the same set of columns). For example, for lh above, we build test[c][f][1,b][1,f] as an array of Fields with the transposed trial shape functions atop the facet on the boundary of cell c with local facet identifier f.
see, e.g., here The rest of arrays required in order to evaluate the integral, namely, unit outward normal n, Jacobian J, and numerical integration weights w and points x, are also built as cell-wise arrays, but with a single-level array of blocks, i.e., they are indexed as n/J/w/x[c][f].
After evaluating the test and trial operands on x[c][f], and performing a broadcasted operation among them, we end up with a cell-wise array of nested arrays of blocks indexed as test_op_trial[c][f][b,b][1,f].
The goal of the last step is to convert test_op_trial[c][f][b,b][1,f] into test_op_trial[c][b,b] while combining the integrand evaluated at quadrature points (i.e. test_op_trial[c][f][b,b][1,f]), the Jacobian evaluated at quadrature points (J[c][f]) and the quadrature weights (w[c][f]) . (Note that integrals over the cells are implemented as cell-wise arrays of blocks indexed as [c][b,b]). In order to do so:
1. see here For each local facet f, we build a lazy_map with IntegrationMap as map. The arguments of this lazy_map call are the restrictions of test_op_trial, J and w to local facet f. These are arrays indexed as [c][b,b][1,f], [c], and [c], resp. ~The resulting array is indexed as [c][b,b][1,f].~
2. see here We build the lazy sum of the arrays resulting from step 1. Note that for a given c, and a non-empty [b,b] block, [c][b,b][1,f] is non-empty for all f.
3. see here We build a lazy_map from the result of 2. with DensifyInnerMostBlockLevelMap as map. This transforms an array indexed as [c][b,b][1,f] into an array indexed as [c][b,b].

Apart from this, the implementation relies on the following types:

CellBoundary. It represents the domain of integration, namely U∂K, for all K. Conceptually, it plays a very similar role to Triangulation. However, my impression is that some of the queries in the API of Triangulation do not properly fit into CellBoundary, so as now conceived I am not sure if it can implement its API in a conceptually clean way, and thus appear in the myriad of places where a Triangulation can appear.
StaticCondensationMap. Statically condensates the cell matrix and cell vector before assembly of the global Schur complement system. It relies on DensifyInnerMostBlockLevelMap in order to build the block matrices that are required to perform the static condensation.
BackwardStaticCondensationMap. Recover the interior DOFs locally from the global ones. It relies on ReblockInteriorDofsMap. Note that after recovering interior DOFs, these are not blocked; ReblockInteriorDofsMap re-blocks them.

Todo/tothink/todiscuss

The fact that all entries in an ArrayBlock have to be of the same type prevents me from implementing this optimization
Tree-based Broadcasted(+) versus single level Broadcasted(+) operation. The latter fails. Why?
Review the current implementation of this function and this function from a performance POV.
Discuss the convenience of having two different functions get_cell_normal_vector and get_cell_owner_normal_vector. I think the second is needed to implement the hybrid primal method, and HDG methods. Is that right?
Discuss the way I am currently anticipating in StaticCondensationMap the more general case of multiple Lagrange multipliers (i.e., multifield statically condensed block). See type parameters here
Discuss whether we detect any showstopper with the current implementation approach in regards to fulfilling the requirements in https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/2#issuecomment-846736375
Is it reasonable to assume this? 1. click here. 2. click here 3. click here.
Discuss the need of two types of interfaces for many of the methods of DensifyInnerMostBlockLevelMap. I.e., pre-computed (see, e.g., here versus inferred block sizes (see e.g., here).
Definition of DoFs of the space of lagrange multipliers. (1) Interpolation versus (2) moment-based FE. Currently, following (1). Thus, the values of Dirichlet DoFs are computed via L2 projection. See here. If we implement (2) we may use the typical workflow based on interpolating analytical function on Dirichlet DoFs.
To review with @fverdugo the construction of the multifield FE function from cell-wise contributions. This spans from this line till the end. I rushed quite a bit in this part, sure it can be done better with the appropriate Gridap tools.
click here
click here
click here
click here
click here
click here
click here

High-level API

Is it now the moment to start thinking how to hide all this complexity in a high level API? or better to wait until we have implemented more methods (i.e., HDG, HHO, multiple Lagrange multipliers. etc.).
I think I foresee some complications to accommodate all the above given the way the current high level API is designed. In particular, AFAIK, the change_domain function is now currently called at two different points. (1) When building a new CellField out of two existing CellFields posed on different domains. (2) When a CellField is posed over a domain different to the one which the Measure represents. In the integrals above, we need to restrict the CellField operands to CellBoundary before step (2).
Another concern that I have is how to fit/leverage MemoArray and the dictionary of LazyArray caches into this workflow? I have largely obviated this.

Gridap kernel extensions

I had to extend some functions in Gridap with additional methods. The good news are that so far I did not need to modify the behaviour of any of the methods in Gridap.

These additional methods are available in this source file. Among these, I think the latter two functions can go straight away to Gridap's kernel in a PR. Agreed? (@fverdugo?)

amartinhuertas / ExploringGridapHybridization.jl

Preliminary implementation of Hybridized RT FE formulation for Darcy eq. #3

Performance concerns

Implementation approach

Todo/tothink/todiscuss

High-level API

Gridap kernel extensions