I am bit concerned about the current JIT compilation times (I did not systematically measured execution times). I have introduced some @time macro calls in the main driver source code for the hotspot lines of code. You can see the result here. The code implementing the conforming RT method for the Darcy problems takes 105 seconds to compile, while the one of the hybridized RT method 770 seconds. I would like to discuss and understand the causes and possible approaches to reduce current compilation times (if any).
Implementation approach
The weak form of the hybridized RT FE method for the Darcy problem has the following two integrals over the cells boundaries d∂K:
∫(mh*(uh⋅n))*d∂K
∫((vh⋅n)*lh)*d∂K
where mh (resp. lh) are the test (resp., trial) functions of the space of Lagrange multipliers, defined on the mesh facets, and (uh⋅n) (resp., vh⋅n) are the normal components of the trial functions (resp., test functions) of the non-conforming RT space, defined on the mesh cells. Note that the RT space is non-conforming, and thus the Piola Map is not required in order to build the pull back of the physical space RT functions (this has been confirmed in an experiment with the current code). Indeed, the pull-back for the non-conforming RT shape functions is just the composition of the reference space function and the inverse of the geometrical map, as with Lagrangian FEs.
The current implementation of the above two integral terms leverages (intensively) the Gridap BlockMap and ArrayBlock{T,N} data types. The latter is the type of the elements in the range of the former, and represents an array of blocks, all of type T. Some of these blocks might be empty.
The code ultimately expresses the integrals above as a sum of integrals over the cell boundary facets. Achieving this outcome is a multi-step process. In the sequel, I am assuming that the integrand has the form of a binary operation in the root of the operation tree (e.g, *, as above). I will refer to the left operand as to the "test" operand, and to the right operand as to the "trial" operand.
see here Assuming that the test operand is posed over a D-dim domain, we build the test operand as a cell-wise array of nested arrays of blocks indexed as test[c][f][b], where c is the global cell identifier, f is the local facet identifier, and b is the block identifier within the blocked layout of the system of PDEs (e.g., velocity->1, pressure->2, lagrange-). Each entry test[c][f][b] is typically either an array of Fields (before evaluation) or a matrix of Numbers (after evaluation). For a given c, we always get a non-empty block for any possible value of f. For a given c, and f, typically there is only a single non-empty block. For example, for vh above, test[c][f][b] contains the cell shape functions restricted to the facet with local identifier f.
see here Assuming that the trial operand is posed over a D-1-dim domain, we build the trial operand as a cell-wise array of nested arrays of blocks indexed as trial[c][f1][1,b][1,f2], where c is the global cell identifier, f1 and f2 are local facet identifiers, and b is the block identifier. Note that the operand has an extra level of nested blocks at the end. In this level, only one block is non-empty, in particular the one for which the value of f1 and f2 match. This extra level is required in order to have the columns of the matrix correctly laid out (otherwise we would be incorrectly summing up all the contributions from all facets altogether in the same set of columns). For example, for lh above, we build test[c][f][1,b][1,f] as an array of Fields with the transposed trial shape functions atop the facet on the boundary of cell c with local facet identifier f.
see, e.g., here The rest of arrays required in order to evaluate the integral, namely, unit outward normal n, Jacobian J, and numerical integration weights w and points x, are also built as cell-wise arrays, but with a single-level array of blocks, i.e., they are indexed as n/J/w/x[c][f].
After evaluating the test and trial operands on x[c][f], and performing a broadcasted operation among them, we end up with a cell-wise array of nested arrays of blocks indexed as test_op_trial[c][f][b,b][1,f].
The goal of the last step is to convert test_op_trial[c][f][b,b][1,f] into test_op_trial[c][b,b] while combining the integrand evaluated at quadrature points (i.e. test_op_trial[c][f][b,b][1,f]), the Jacobian evaluated at quadrature points (J[c][f]) and the quadrature weights (w[c][f]) . (Note that integrals over the cells are implemented as cell-wise arrays of blocks indexed as [c][b,b]). In order to do so:
see here For each local facet f, we build a lazy_map with IntegrationMap as map. The arguments of this lazy_map call are the restrictions of test_op_trial, J and w to local facet f. These are arrays indexed as [c][b,b][1,f], [c], and [c], resp. ~The resulting array is indexed as [c][b,b][1,f].~
see here We build the lazy sum of the arrays resulting from step 1. Note that for a given c, and a non-empty [b,b] block, [c][b,b][1,f] is non-empty for all f.
see here We build a lazy_map from the result of 2. with DensifyInnerMostBlockLevelMap as map. This transforms an array indexed as [c][b,b][1,f] into an array indexed as [c][b,b].
Apart from this, the implementation relies on the following types:
CellBoundary. It represents the domain of integration, namely U∂K, for all K. Conceptually, it plays a very similar role to Triangulation. However, my impression is that some of the queries in the API of Triangulation do not properly fit into CellBoundary, so as now conceived I am not sure if it can implement its API in a conceptually clean way, and thus appear in the myriad of places where a Triangulation can appear.
StaticCondensationMap. Statically condensates the cell matrix and cell vector before assembly of the global Schur complement system. It relies on DensifyInnerMostBlockLevelMap in order to build the block matrices that are required to perform the static condensation.
BackwardStaticCondensationMap. Recover the interior DOFs locally from the global ones. It relies on ReblockInteriorDofsMap. Note that after recovering interior DOFs, these are not blocked; ReblockInteriorDofsMap re-blocks them.
Todo/tothink/todiscuss
The fact that all entries in an ArrayBlock have to be of the same type prevents me from implementing this optimization
Discuss the convenience of having two different functions get_cell_normal_vector and get_cell_owner_normal_vector. I think the second is needed to implement the hybrid primal method, and HDG methods. Is that right?
Discuss the way I am currently anticipating in StaticCondensationMap the more general case of multiple Lagrange multipliers (i.e., multifield statically condensed block). See type parameters here
Discuss the need of two types of interfaces for many of the methods of DensifyInnerMostBlockLevelMap. I.e., pre-computed (see, e.g., here versus inferred block sizes (see e.g., here).
Definition of DoFs of the space of lagrange multipliers. (1) Interpolation versus (2) moment-based FE. Currently, following (1). Thus, the values of Dirichlet DoFs are computed via L2 projection. See here. If we implement (2) we may use the typical workflow based on interpolating analytical function on Dirichlet DoFs.
To review with @fverdugo the construction of the multifield FE function from cell-wise contributions. This spans from this line till the end. I rushed quite a bit in this part, sure it can be done better with the appropriate Gridap tools.
Is it now the moment to start thinking how to hide all this complexity in a high level API? or better to wait until we have implemented more methods (i.e., HDG, HHO, multiple Lagrange multipliers. etc.).
I think I foresee some complications to accommodate all the above given the way the current high level API is designed. In particular, AFAIK, the change_domain function is now currently called at two different points. (1) When building a new CellField out of two existing CellFields posed on different domains. (2) When a CellField is posed over a domain different to the one which the Measure represents. In the integrals above, we need to restrict the CellField operands to CellBoundary before step (2).
Another concern that I have is how to fit/leverage MemoArray and the dictionary of LazyArray caches into this workflow? I have largely obviated this.
Gridap kernel extensions
I had to extend some functions in Gridap with additional methods. The good news are that so far I did not need to modify the behaviour of any of the methods in Gridap.
These additional methods are available in this source file. Among these, I think the latter two functions can go straight away to Gridap's kernel in a PR. Agreed? (@fverdugo?)
Some comments from the meeting with @amartinhuertas
Check that the lazy array resulting from the evaluation of test and trial (Steps 1 and 2) blocked funs involves only pre-computed numerical values (i.e., poluynomials are not evaluated at each physical cell)
Try that the type used to represent the result is as simple as possible.
Try to extend IntegrationMap where J and w have this block structure J[c][f] (remove restriction to block since it will not be needed) [Done in 40a6f4f4694d98f77fcaf6bf2297037aecab7b6a]
Implement a custom kernel to sum over local faces [Done in 40a6f4f4694d98f77fcaf6bf2297037aecab7b6a]
@fverdugo @santiagobadia ... pre-meeting and in-meeting material for our next meeting (I will prepare a pre-class quiz, ;-) )
The main driver program is available here
Performance concerns
I am bit concerned about the current JIT compilation times (I did not systematically measured execution times). I have introduced some
@time
macro calls in the main driver source code for the hotspot lines of code. You can see the result here. The code implementing the conforming RT method for the Darcy problems takes 105 seconds to compile, while the one of the hybridized RT method 770 seconds. I would like to discuss and understand the causes and possible approaches to reduce current compilation times (if any).Implementation approach
The weak form of the hybridized RT FE method for the Darcy problem has the following two integrals over the cells boundaries
d∂K
:∫(mh*(uh⋅n))*d∂K
∫((vh⋅n)*lh)*d∂K
where
mh
(resp.lh
) are the test (resp., trial) functions of the space of Lagrange multipliers, defined on the mesh facets, and(uh⋅n)
(resp.,vh⋅n
) are the normal components of the trial functions (resp., test functions) of the non-conforming RT space, defined on the mesh cells. Note that the RT space is non-conforming, and thus the Piola Map is not required in order to build the pull back of the physical space RT functions (this has been confirmed in an experiment with the current code). Indeed, the pull-back for the non-conforming RT shape functions is just the composition of the reference space function and the inverse of the geometrical map, as with Lagrangian FEs.The current implementation of the above two integral terms leverages (intensively) the Gridap
BlockMap
andArrayBlock{T,N}
data types. The latter is the type of the elements in the range of the former, and represents an array of blocks, all of typeT
. Some of these blocks might be empty.The code ultimately expresses the integrals above as a sum of integrals over the cell boundary facets. Achieving this outcome is a multi-step process. In the sequel, I am assuming that the integrand has the form of a binary operation in the root of the operation tree (e.g,
*
, as above). I will refer to the left operand as to the "test" operand, and to the right operand as to the "trial" operand.see here Assuming that the
test
operand is posed over aD
-dim domain, we build thetest
operand as a cell-wise array of nested arrays of blocks indexed astest[c][f][b]
, wherec
is the global cell identifier,f
is the local facet identifier, andb
is the block identifier within the blocked layout of the system of PDEs (e.g., velocity->1, pressure->2, lagrange-). Each entrytest[c][f][b]
is typically either an array ofFields
(before evaluation) or a matrix ofNumber
s (after evaluation). For a givenc
, we always get a non-empty block for any possible value off
. For a givenc
, andf
, typically there is only a single non-empty block. For example, forvh
above,test[c][f][b]
contains the cell shape functions restricted to the facet with local identifierf
.see here Assuming that the
trial
operand is posed over aD-1
-dim domain, we build thetrial
operand as a cell-wise array of nested arrays of blocks indexed astrial[c][f1][1,b][1,f2]
, wherec
is the global cell identifier,f1
andf2
are local facet identifiers, andb
is the block identifier. Note that the operand has an extra level of nested blocks at the end. In this level, only one block is non-empty, in particular the one for which the value off1
andf2
match. This extra level is required in order to have the columns of the matrix correctly laid out (otherwise we would be incorrectly summing up all the contributions from all facets altogether in the same set of columns). For example, forlh
above, we buildtest[c][f][1,b][1,f]
as an array ofField
s with the transposed trial shape functions atop the facet on the boundary of cellc
with local facet identifierf
.see, e.g., here The rest of arrays required in order to evaluate the integral, namely, unit outward normal
n
, JacobianJ
, and numerical integration weightsw
and pointsx
, are also built as cell-wise arrays, but with a single-level array of blocks, i.e., they are indexed asn/J/w/x[c][f]
.After evaluating the
test
andtrial
operands onx[c][f]
, and performing a broadcasted operation among them, we end up with a cell-wise array of nested arrays of blocks indexed astest_op_trial[c][f][b,b][1,f]
.The goal of the last step is to convert
test_op_trial[c][f][b,b][1,f]
intotest_op_trial[c][b,b]
while combining the integrand evaluated at quadrature points (i.e.test_op_trial[c][f][b,b][1,f]
), the Jacobian evaluated at quadrature points (J[c][f]
) and the quadrature weights (w[c][f]
) . (Note that integrals over the cells are implemented as cell-wise arrays of blocks indexed as[c][b,b]
). In order to do so:f
, we build alazy_map
withIntegrationMap
as map. The arguments of thislazy_map
call are the restrictions oftest_op_trial
,J
andw
to local facetf
. These are arrays indexed as[c][b,b][1,f]
,[c]
, and[c]
, resp. ~The resulting array is indexed as[c][b,b][1,f]
.~c
, and a non-empty[b,b]
block,[c][b,b][1,f]
is non-empty for allf
.lazy_map
from the result of 2. withDensifyInnerMostBlockLevelMap
as map. This transforms an array indexed as[c][b,b][1,f]
into an array indexed as[c][b,b]
.Apart from this, the implementation relies on the following types:
CellBoundary
. It represents the domain of integration, namely U∂K, for all K. Conceptually, it plays a very similar role toTriangulation
. However, my impression is that some of the queries in the API ofTriangulation
do not properly fit intoCellBoundary
, so as now conceived I am not sure if it can implement its API in a conceptually clean way, and thus appear in the myriad of places where aTriangulation
can appear.StaticCondensationMap
. Statically condensates the cell matrix and cell vector before assembly of the global Schur complement system. It relies onDensifyInnerMostBlockLevelMap
in order to build the block matrices that are required to perform the static condensation.BackwardStaticCondensationMap
. Recover the interior DOFs locally from the global ones. It relies onReblockInteriorDofsMap
. Note that after recovering interior DOFs, these are not blocked;ReblockInteriorDofsMap
re-blocks them.Todo/tothink/todiscuss
The fact that all entries in an
ArrayBlock
have to be of the same type prevents me from implementing this optimizationTree-based
Broadcasted(+)
versus single levelBroadcasted(+)
operation. The latter fails. Why?Review the current implementation of this function and this function from a performance POV.
Discuss the convenience of having two different functions
get_cell_normal_vector
andget_cell_owner_normal_vector
. I think the second is needed to implement the hybrid primal method, and HDG methods. Is that right?Discuss the way I am currently anticipating in
StaticCondensationMap
the more general case of multiple Lagrange multipliers (i.e., multifield statically condensed block). See type parameters hereDiscuss whether we detect any showstopper with the current implementation approach in regards to fulfilling the requirements in https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/2#issuecomment-846736375
Is it reasonable to assume this? 1. click here. 2. click here 3. click here.
Discuss the need of two types of interfaces for many of the methods of
DensifyInnerMostBlockLevelMap
. I.e., pre-computed (see, e.g., here versus inferred block sizes (see e.g., here).Definition of DoFs of the space of lagrange multipliers. (1) Interpolation versus (2) moment-based FE. Currently, following (1). Thus, the values of Dirichlet DoFs are computed via L2 projection. See here. If we implement (2) we may use the typical workflow based on interpolating analytical function on Dirichlet DoFs.
To review with @fverdugo the construction of the multifield FE function from cell-wise contributions. This spans from this line till the end. I rushed quite a bit in this part, sure it can be done better with the appropriate
Gridap
tools.click here
click here
click here
click here
click here
click here
click here
High-level API
Is it now the moment to start thinking how to hide all this complexity in a high level API? or better to wait until we have implemented more methods (i.e., HDG, HHO, multiple Lagrange multipliers. etc.).
I think I foresee some complications to accommodate all the above given the way the current high level API is designed. In particular, AFAIK, the
change_domain
function is now currently called at two different points. (1) When building a newCellField
out of two existingCellField
s posed on different domains. (2) When aCellField
is posed over a domain different to the one which theMeasure
represents. In the integrals above, we need to restrict theCellField
operands toCellBoundary
before step (2).Another concern that I have is how to fit/leverage
MemoArray
and the dictionary ofLazyArray
caches into this workflow? I have largely obviated this.Gridap kernel extensions
I had to extend some functions in
Gridap
with additional methods. The good news are that so far I did not need to modify the behaviour of any of the methods inGridap
.These additional methods are available in this source file. Among these, I think the latter two functions can go straight away to Gridap's kernel in a PR. Agreed? (@fverdugo?)