Open amartinhuertas opened 3 years ago
Hi @amartinhuertas regarding the cache for VectorBlock, up to now, working with VectorBlocks of CachedArrays has been enough. See eg how the cache for the multiplication of a MatrixBlock times a VectorBlock is implemented.
Congrats for the reduction in jit compile times!
I still have to systematically evaluate execution times. I will do it soon.
I have written the code required to systematically obtain RT versus RT-H computation times. Results below for the 2D Darcy problem below, extracted in my laptop (I will generate 3D results soon). The computation time of RT was split into 3 stages, and that of RT-H into 4 stages. To know more about what do these stages include see, here for RT and here for RT-H.
Row │ Assembly RT [s] n k d Total RT [s] num_cells l2_err_norm Preassembly RT [s] experiment Solve RT [s]
│ Float64 Int64 Int64 Int64 Float64 Int64 Float64 Float64 String Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.203471 10 0 2 0.207307 100 0.0 0.00183717 solve_darcy_rt 0.00167782
2 │ 0.211317 20 0 2 0.221038 400 0.0 0.00226532 solve_darcy_rt 0.00740818
3 │ 0.221983 30 0 2 0.247982 900 0.0 0.00270947 solve_darcy_rt 0.0232355
4 │ 0.235304 40 0 2 0.281353 1600 0.0 0.00325786 solve_darcy_rt 0.0417356
5 │ 0.252771 50 0 2 0.322815 2500 0.0 0.00392719 solve_darcy_rt 0.0649583
6 │ 0.271242 60 0 2 0.36666 3600 0.0 0.00478003 solve_darcy_rt 0.0906382
7 │ 0.292719 70 0 2 0.422007 4900 0.0 0.0056587 solve_darcy_rt 0.123629
8 │ 0.321242 80 0 2 0.506611 6400 0.0 0.00671869 solve_darcy_rt 0.177498
9 │ 0.361558 90 0 2 0.603206 8100 0.0 0.00806464 solve_darcy_rt 0.23242
10 │ 0.387417 100 0 2 0.695559 10000 0.0 0.00938302 solve_darcy_rt 0.298648
Row │ Trace solve RTH [s] n Back substitution RTH [s] Trace assembly RTH [s] Total RTH [s] Preassembly RTH [s] experiment d k l2_err_norm num_cells
│ Float64 Int64 Float64 Float64 Float64 Float64 String Int64 Int64 Float64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.000567439 10 0.0123137 0.00950768 0.0339472 0.0113444 solve_darcy_rth 2 0 3.39898e-14 100
2 │ 0.00171538 20 0.0185897 0.0230049 0.0558599 0.0122579 solve_darcy_rth 2 0 7.87394e-14 400
3 │ 0.00328242 30 0.0282329 0.0438811 0.089745 0.0134034 solve_darcy_rth 2 0 1.23727e-13 900
4 │ 0.00650433 40 0.121442 0.0675853 0.21266 0.0148262 solve_darcy_rth 2 0 2.32726e-13 1600
5 │ 0.0161684 50 0.14766 0.0992964 0.280115 0.0167884 solve_darcy_rth 2 0 4.20234e-13 2500
6 │ 0.0277386 60 0.177405 0.154259 0.378295 0.0186285 solve_darcy_rth 2 0 1.24761e-12 3600
7 │ 0.0354794 70 0.200983 0.202605 0.460354 0.0211949 solve_darcy_rth 2 0 1.25006e-12 4900
8 │ 0.0523109 80 0.229517 0.258426 0.564463 0.0231257 solve_darcy_rth 2 0 1.37409e-12 6400
9 │ 0.0647161 90 0.273576 0.343874 0.707177 0.0250116 solve_darcy_rth 2 0 3.50109e-12 8100
10 │ 0.106367 100 0.318528 0.413628 0.866839 0.0280575 solve_darcy_rth 2 0 1.27853e-12 10000
Another important check i usually do is using Profview.jl to look for type instabilities and spurious allocations.
Another important check i usually do is using Profview.jl to look for type instabilities and spurious allocations.
Yeah! Sure! I just wanted to record the starting point. There are for sure many optimizations that can still be applied, some of them are gathered in the issue description above.
For completeness, 3D results in the tables below. BTW, I have noticed, using htop
, that the solve
step seems to use 4-way multithreaded parallelism in my laptop (4 CPU cores). Thus, the time of the solve stage is reduced by this factor.
Row │ Assembly RT [s] n k d Total RT [s] num_cells l2_err_norm Preassembly RT [s] experiment Solve RT [s]
│ Float64 Int64 Int64 Int64 Float64 Int64 Float64 Float64 String Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.238571 5 0 3 0.252969 125 6.73631e-15 0.00535965 solve_darcy_rt 0.0046532
2 │ 0.289064 10 0 3 0.388017 1000 1.64125e-14 0.0101451 solve_darcy_rt 0.0852543
3 │ 0.411484 15 0 3 0.945631 3375 1.34866e-14 0.0123229 solve_darcy_rt 0.521278
4 │ 0.600865 20 0 3 3.5437 8000 5.29761e-14 0.0170745 solve_darcy_rt 2.92559
5 │ 1.0045 25 0 3 13.7597 15625 5.21513e-14 0.0240333 solve_darcy_rt 10.7651
6 │ 1.52938 30 0 3 42.0436 27000 2.83606e-14 0.0394863 solve_darcy_rt 40.4732
Row │ Trace solve RTH [s] n Back substitution RTH [s] Trace assembly RTH [s] Total RTH [s] Preassembly RTH [s] experiment d k l2_err_norm num_cells
│ Float64 Int64 Float64 Float64 Float64 Float64 String Int64 Int64 Float64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.000952376 5 0.0137373 0.0119081 0.0522225 0.0249174 solve_darcy_rth 3 0 1.70403e-14 125
2 │ 0.0189655 10 0.129788 0.0588131 0.242794 0.0329939 solve_darcy_rth 3 0 3.84128e-14 1000
3 │ 0.140136 15 0.205553 0.214513 0.604726 0.0444707 solve_darcy_rth 3 0 6.41082e-14 3375
4 │ 0.522282 20 0.340954 0.531713 1.46237 0.0674227 solve_darcy_rth 3 0 9.58977e-14 8000
5 │ 1.42999 25 0.670754 1.06884 3.4215 0.112977 solve_darcy_rth 3 0 1.67437e-13 15625
6 │ 5.0095 30 1.03514 2.06824 8.38236 0.187149 solve_darcy_rth 3 0 1.42677e-13 27000
I am bit puzzled by these results. RT-H is much faster than RT!!!! For 3D problems, the difference is in the Solve stage. But I think that the fact that the pull back of all physical space functions in RT-H is just the function in reference space also plays a role. We need to do more profiling to understand the difference (perhaps RT is just slow), but the results are promising!!!
Another important check i usually do is using Profview.jl to look for type instabilities and spurious allocations.
I just started using Profview.jl (seems quite interesting). Still familiarizing with it. Snapshot below and profile data attached (just in case you want to take a look at it when you come back from your holidays, no hurries). Do you know how can I find type instabilities using this tool?
The rule of thumb is: red-ish colors == type instability, orage-ish colors == heap allocations.
Hi @amartinhuertas ! By looking at the flame graph it seems that there is quite room for cutting down cpu times. Most of the "red" and "orange" functions at the top of the flames are taking a significant portion of the time. The bad news is that fixing this is not always trivial 😢
For reference, this is how look the flame graphs of the assembly loop for the poisson equation before (v0.15) and after (v0.16) fixing some annoying type instabilities. The goal is to get rid of the "red" color in the inner calls. Some "red" functions in the outer calls is OK and inevitable in some cases.
Hi @amartinhuertas ! By looking at the flame graph it seems that there is quite room for cutting down cpu times.
Ok, great.
Most of the "red" and "orange" functions at the top of the flames are taking a significant portion of the time.
Is there any references where I can read which is the meaning of the colors? In the README file of https://github.com/timholy/ProfileView.jl it mentions red but does not say anything about, e.g, orange.
For reference, this is how look the flame graphs of the assembly loop for the poisson equation before (v0.15) and after (v0.16) fixing some annoying type instabilities. The goal is to get rid of the "red" color in the inner calls. Some "red" functions in the outer calls is OK and inevitable in some cases.
Ok, I have many red inner bars. Is there a tool to investigate further the issue? (I know @code_wantype
, but detecting type instabilities with it is quite cumbersome/time consuming).
Is there any references where I can read which is the meaning of the colors? In the README file of https://github.com/timholy/ProfileView.jl it mentions red but does not say anything about, e.g, orange.
https://timholy.github.io/FlameGraphs.jl/stable/reference/#FlameGraphs.FlameColors
Ok, I have many red inner bars. Is there a tool to investigate further the issue? (I know @code_wantype, but detecting type instabilities with it is quite cumbersome/time consuming).
This is an art.... I try to find a MWE that reproduces one of the (possibly many) inner "red"/"orange" bars, try to fix this, and then move to the next one. To build the MWE, @profview
and print_op_tree
are you friends. Instead of profiling a loop implemented in Gridap, i usually extract the lazy array, eg. arr = a(du,dv)[Ω]
, and profile a custom loop (this further simplifies the flame graph). E.g., @profview myloop!(cache,arr)
, where
cache = array_cache(arr)
@noinline function myloop!(cache,arr)
l = 0
for i in 1:length(arr)
ai = getindex!(cache,arr,i)
l += length(ai)
end
l
end
Once we have a really small MWE, you can start using @code_warntype
I forgot to mention that I build the MWE, by isolating branches of the tree. i.e., arr1 = arr.args[1]
to see if the problem is in the first branch of the tree. print_op_tree
helps to navigate in the tree
Ok, great. Thans for the advice. I will look into it whenever I have some time and let you know!
Ok, great. Thans for the advice. I will look into it whenever I have some time and let you know!
Revamped flame graph below after several optmizations! Note there are still four small red intervals. These are caused by the following member variable being virtual.
Do u know if & how this can be waived?
Use a new type param, eg, facet_vector::A
Use a new type param, eg, facet_vector::A
Ok, done. It was a bit more involved though, as an additional member variable storing the eltype of A
is also needed to specify the parameter type in the inheritance relation. see https://github.com/amartinhuertas/ExploringGridapHybridization.jl/commit/6c883841fb985a39a3260754b40337ad17ca24a3 for more details
Hi @amartinhuertas, you don't need the extra variable (I would remove it). It is possible to have type parameters not linked with any member variable. There are several options to initialize the "dangling" type parameter, but the easiest is to use the inner constructor.
struct CellBoundaryVectorFromFacetVector{T,G,C,V} <: AbstractVector{Gridap.Fields.VectorBlock{T}}
glue::G
cell_wise_facets_ids::C
facet_vector::V
function CellBoundaryVectorFromFacetVector(glue,cell_wise_facets_ids,facet_vector)
G=typeof(glue)
C=typeof(cell_wise_facets_ids)
V=typeof(facet_vector)
T=eltype(V)
new{T,G,C,V}(glue,cell_wise_facets_ids,facet_vector)
end
end
This is the other option
struct CellBoundaryVectorFromFacetVector{T,G,C,V<:AbstractVector{T}} <: AbstractVector{Gridap.Fields.VectorBlock{T}}
glue::G
cell_wise_facets_ids::C
facet_vector::V
end
Hi @amartinhuertas, you don't need the extra variable (I would remove it).
Ok, done. Thanks! I did not know that this was legal.
For completeness, updated results after optimization. Only mild performance improvements achieved.
10×11 DataFrame
Row │ Trace solve RTH [s] n Back substitution RTH [s] Trace assembly RTH [s] Total RTH [s] Preassembly RTH [s] experiment d k l2_err_norm num_cells
│ Float64 Int64 Float64 Float64 Float64 Float64 String Int64 Int64 Float64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.000425657 10 0.00584436 0.00739025 0.0229828 0.00932253 solve_darcy_rth 2 0 3.39898e-14 100
2 │ 0.00158013 20 0.0127182 0.0250612 0.0542051 0.0131142 solve_darcy_rth 2 0 7.87394e-14 400
3 │ 0.00320832 30 0.0231626 0.0454528 0.0867651 0.0140094 solve_darcy_rth 2 0 1.23727e-13 900
4 │ 0.00641192 40 0.0799717 0.066508 0.174519 0.0138469 solve_darcy_rth 2 0 2.32726e-13 1600
5 │ 0.0147358 50 0.138062 0.102409 0.274801 0.0175314 solve_darcy_rth 2 0 4.20234e-13 2500
6 │ 0.0395999 60 0.144887 0.140463 0.348583 0.0204034 solve_darcy_rth 2 0 1.24761e-12 3600
7 │ 0.0313974 70 0.192398 0.201534 0.448329 0.0220114 solve_darcy_rth 2 0 1.25006e-12 4900
8 │ 0.0485266 80 0.214092 0.252164 0.543468 0.0250989 solve_darcy_rth 2 0 1.37409e-12 6400
9 │ 0.0843805 90 0.24775 0.309685 0.669939 0.0279402 solve_darcy_rth 2 0 3.50109e-12 8100
10 │ 0.0922157 100 0.294551 0.397105 0.817754 0.0305282 solve_darcy_rth 2 0 1.27853e-12 10000
Row │ Trace solve RTH [s] n Back substitution RTH [s] Trace assembly RTH [s] Total RTH [s] Preassembly RTH [s] experiment d k l2_err_norm num_cells
│ Float64 Int64 Float64 Float64 Float64 Float64 String Int64 Int64 Float64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.000936804 5 0.00872965 0.0119614 0.0542423 0.0293648 solve_darcy_rth 3 0 1.70403e-14 125
2 │ 0.0184638 10 0.113316 0.0615835 0.229494 0.0354623 solve_darcy_rth 3 0 3.84128e-14 1000
3 │ 0.141921 15 0.199493 0.213998 0.602686 0.044557 solve_darcy_rth 3 0 6.41082e-14 3375
4 │ 0.607383 20 0.33137 0.583792 1.61177 0.0701601 solve_darcy_rth 3 0 9.58977e-14 8000
5 │ 1.70309 25 0.657884 1.28078 3.81652 0.117159 solve_darcy_rth 3 0 1.67437e-13 15625
6 │ 5.25552 30 1.02274 2.17287 8.70974 0.187296 solve_darcy_rth 3 0 1.42677e-13 27000
Hi @fverdugo, @santiagobadia,
I have come up with an optimized implementation of RTH and LDGH following our last meeting (https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/3#issuecomment-871246763). During the meeting, we realized that the previous implementation (issues #3 and #4), while functional, was nothing but a reference implementation of these methods, using existing building blocks. It turns out that, in order to optimize these methods, new infrastructure (new data types) + additional methods in Gridap's kernel have to be overloaded. The good news are that so far I did not need to modify the behaviour of any of the methods/types in Gridap. Only extension.
Current (JIT compilation) performance
I still have to systematically evaluate execution times. I will do it soon. In the meantime I have measured Julia 1.6.1 JIT compilation times in Github Actions.
This represents a significant improvement with respect to the non-optimized version! (770 and 2771 seconds,respect.)
Brief implementation notes (not complete)
For the records, I implemented the following new structs.
BoundaryTriangulation
is required.Still pending issues
After the process, it turns out that some of the issues that were raised in issues #3 and #4 no longer apply to the current implementation. In the sequel I gather here those which still apply for the current implementation. We still have to discuss these in a meeting.
Issue #3
Todo/tothink/todiscuss
[NEW] We dont have an array cache of
VectorBlock
elements in Gridap, right? In such a case, we cannot currently bypass the limitation of same number of faces per cell. See here. @fverdugo comment: Hi @amartinhuertas regarding the cache for VectorBlock, up to now, working with VectorBlocks of CachedArrays has been enough. See eg how the cache for the multiplication of a MatrixBlock times a VectorBlock is implemented.[NEW] Discuss the need of this line and whether it can be bypassed and how.
The fact that all entries in an
ArrayBlock
have to be of the same type prevents me from implementing this optimizationDiscuss the convenience of having two different functions
get_cell_normal_vector
andget_cell_owner_normal_vector
, or whether there exists a more clever approach to achieve the same outcome, as per required in the LDG-H method.Discuss the way I am currently anticipating in
StaticCondensationMap
the more general case of multiple Lagrange multipliers (i.e., multifield statically condensed block). See type parameters hereDiscuss whether we detect any showstopper with the current implementation approach in regards to fulfilling the requirements in https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/2#issuecomment-846736375
Is it reasonable to assume this? 1. click here. 2. click here 3. click here.
Discuss the need of two types of interfaces for many of the methods of
DensifyInnerMostBlockLevelMap
. I.e., pre-computed (see, e.g., here versus inferred block sizes (see e.g., here).Definition of DoFs of the space of Lagrange multipliers. (1) Interpolation versus (2) moment-based FE. Currently, following (1). Thus, the values of Dirichlet DoFs are computed via L2 projection. See here. If we implement (2) we may use the typical workflow based on interpolating analytical function on Dirichlet DoFs.
To review with @fverdugo the construction of the multifield FE function from cell-wise contributions. This spans from this line till the end. I rushed quite a bit in this part, sure it can be done better with the appropriate
Gridap
tools.click here
click here
click here
click here
Issue #4
To be discussed in full, except "Performance concerns" section.