amartinhuertas / ExploringGridapHybridization.jl

0 stars 0 forks source link

Optimized implementation of RT-H and LDG-H (pending issues) #5

Open amartinhuertas opened 3 years ago

amartinhuertas commented 3 years ago

Hi @fverdugo, @santiagobadia,

I have come up with an optimized implementation of RTH and LDGH following our last meeting (https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/3#issuecomment-871246763). During the meeting, we realized that the previous implementation (issues #3 and #4), while functional, was nothing but a reference implementation of these methods, using existing building blocks. It turns out that, in order to optimize these methods, new infrastructure (new data types) + additional methods in Gridap's kernel have to be overloaded. The good news are that so far I did not need to modify the behaviour of any of the methods/types in Gridap. Only extension.

Current (JIT compilation) performance

I still have to systematically evaluate execution times. I will do it soon. In the meantime I have measured Julia 1.6.1 JIT compilation times in Github Actions.

RT method. 106 seconds.
RT-H method (optimized). 69 seconds.
LGH-H method (optimized). 126 seconds.

This represents a significant improvement with respect to the non-optimized version! (770 and 2771 seconds,respect.)

Brief implementation notes (not complete)

For the records, I implemented the following new structs.

CellBoundaryOpt. As @fverdugo mentioned, only a single BoundaryTriangulation is required.
CellBoundaryCompressedVector
CellBoundaryVectorFromFacetVector
CellBoundaryOwnerNref
SumFacetsMap

Still pending issues

After the process, it turns out that some of the issues that were raised in issues #3 and #4 no longer apply to the current implementation. In the sequel I gather here those which still apply for the current implementation. We still have to discuss these in a meeting.

Issue #3

Todo/tothink/todiscuss

[NEW] We dont have an array cache of VectorBlock elements in Gridap, right? In such a case, we cannot currently bypass the limitation of same number of faces per cell. See here. @fverdugo comment: Hi @amartinhuertas regarding the cache for VectorBlock, up to now, working with VectorBlocks of CachedArrays has been enough. See eg how the cache for the multiplication of a MatrixBlock times a VectorBlock is implemented.
[NEW] Discuss the need of this line and whether it can be bypassed and how.
The fact that all entries in an ArrayBlock have to be of the same type prevents me from implementing this optimization
Discuss the convenience of having two different functions get_cell_normal_vector and get_cell_owner_normal_vector, or whether there exists a more clever approach to achieve the same outcome, as per required in the LDG-H method.
Discuss the way I am currently anticipating in StaticCondensationMap the more general case of multiple Lagrange multipliers (i.e., multifield statically condensed block). See type parameters here
Discuss whether we detect any showstopper with the current implementation approach in regards to fulfilling the requirements in https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/2#issuecomment-846736375
Is it reasonable to assume this? 1. click here. 2. click here 3. click here.
Discuss the need of two types of interfaces for many of the methods of DensifyInnerMostBlockLevelMap. I.e., pre-computed (see, e.g., here versus inferred block sizes (see e.g., here).
Definition of DoFs of the space of Lagrange multipliers. (1) Interpolation versus (2) moment-based FE. Currently, following (1). Thus, the values of Dirichlet DoFs are computed via L2 projection. See here. If we implement (2) we may use the typical workflow based on interpolating analytical function on Dirichlet DoFs.
To review with @fverdugo the construction of the multifield FE function from cell-wise contributions. This spans from this line till the end. I rushed quite a bit in this part, sure it can be done better with the appropriate Gridap tools.
click here
click here
click here
click here

Issue #4

To be discussed in full, except "Performance concerns" section.

fverdugo commented 3 years ago

Hi @amartinhuertas regarding the cache for VectorBlock, up to now, working with VectorBlocks of CachedArrays has been enough. See eg how the cache for the multiplication of a MatrixBlock times a VectorBlock is implemented.

Congrats for the reduction in jit compile times!

amartinhuertas commented 3 years ago

I still have to systematically evaluate execution times. I will do it soon.

I have written the code required to systematically obtain RT versus RT-H computation times. Results below for the 2D Darcy problem below, extracted in my laptop (I will generate 3D results soon). The computation time of RT was split into 3 stages, and that of RT-H into 4 stages. To know more about what do these stages include see, here for RT and here for RT-H.

2D RT

 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.203471     10      0      2      0.207307        100          0.0          0.00183717  solve_darcy_rt    0.00167782
   2 │        0.211317     20      0      2      0.221038        400          0.0          0.00226532  solve_darcy_rt    0.00740818
   3 │        0.221983     30      0      2      0.247982        900          0.0          0.00270947  solve_darcy_rt    0.0232355
   4 │        0.235304     40      0      2      0.281353       1600          0.0          0.00325786  solve_darcy_rt    0.0417356
   5 │        0.252771     50      0      2      0.322815       2500          0.0          0.00392719  solve_darcy_rt    0.0649583
   6 │        0.271242     60      0      2      0.36666        3600          0.0          0.00478003  solve_darcy_rt    0.0906382
   7 │        0.292719     70      0      2      0.422007       4900          0.0          0.0056587   solve_darcy_rt    0.123629
   8 │        0.321242     80      0      2      0.506611       6400          0.0          0.00671869  solve_darcy_rt    0.177498
   9 │        0.361558     90      0      2      0.603206       8100          0.0          0.00806464  solve_darcy_rt    0.23242
  10 │        0.387417    100      0      2      0.695559      10000          0.0          0.00938302  solve_darcy_rt    0.298648

2D RT-H

 Row │ Trace solve RTH [s]  n      Back substitution RTH [s]  Trace assembly RTH [s]  Total RTH [s]  Preassembly RTH [s]  experiment       d      k      l2_err_norm  num_cells 
     │ Float64              Int64  Float64                    Float64                 Float64        Float64              String           Int64  Int64  Float64      Int64     
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │         0.000567439     10                  0.0123137              0.00950768      0.0339472            0.0113444  solve_darcy_rth      2      0  3.39898e-14        100
   2 │         0.00171538      20                  0.0185897              0.0230049       0.0558599            0.0122579  solve_darcy_rth      2      0  7.87394e-14        400
   3 │         0.00328242      30                  0.0282329              0.0438811       0.089745             0.0134034  solve_darcy_rth      2      0  1.23727e-13        900
   4 │         0.00650433      40                  0.121442               0.0675853       0.21266              0.0148262  solve_darcy_rth      2      0  2.32726e-13       1600
   5 │         0.0161684       50                  0.14766                0.0992964       0.280115             0.0167884  solve_darcy_rth      2      0  4.20234e-13       2500
   6 │         0.0277386       60                  0.177405               0.154259        0.378295             0.0186285  solve_darcy_rth      2      0  1.24761e-12       3600
   7 │         0.0354794       70                  0.200983               0.202605        0.460354             0.0211949  solve_darcy_rth      2      0  1.25006e-12       4900
   8 │         0.0523109       80                  0.229517               0.258426        0.564463             0.0231257  solve_darcy_rth      2      0  1.37409e-12       6400
   9 │         0.0647161       90                  0.273576               0.343874        0.707177             0.0250116  solve_darcy_rth      2      0  3.50109e-12       8100
  10 │         0.106367       100                  0.318528               0.413628        0.866839             0.0280575  solve_darcy_rth      2      0  1.27853e-12      10000

fverdugo commented 3 years ago

Another important check i usually do is using Profview.jl to look for type instabilities and spurious allocations.

amartinhuertas commented 3 years ago

Another important check i usually do is using Profview.jl to look for type instabilities and spurious allocations.

Yeah! Sure! I just wanted to record the starting point. There are for sure many optimizations that can still be applied, some of them are gathered in the issue description above.

amartinhuertas commented 3 years ago

For completeness, 3D results in the tables below. BTW, I have noticed, using htop, that the solve step seems to use 4-way multithreaded parallelism in my laptop (4 CPU cores). Thus, the time of the solve stage is reduced by this factor.

3D RT

 Row │ Assembly RT [s]  n      k      d      Total RT [s]  num_cells  l2_err_norm  Preassembly RT [s]  experiment      Solve RT [s] 
     │ Float64          Int64  Int64  Int64  Float64       Int64      Float64      Float64             String          Float64      
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │        0.238571      5      0      3      0.252969        125  6.73631e-15          0.00535965  solve_darcy_rt     0.0046532
   2 │        0.289064     10      0      3      0.388017       1000  1.64125e-14          0.0101451   solve_darcy_rt     0.0852543
   3 │        0.411484     15      0      3      0.945631       3375  1.34866e-14          0.0123229   solve_darcy_rt     0.521278
   4 │        0.600865     20      0      3      3.5437         8000  5.29761e-14          0.0170745   solve_darcy_rt     2.92559
   5 │        1.0045       25      0      3     13.7597        15625  5.21513e-14          0.0240333   solve_darcy_rt    10.7651
   6 │        1.52938      30      0      3     42.0436        27000  2.83606e-14          0.0394863   solve_darcy_rt    40.4732

3D RT-H

 Row │ Trace solve RTH [s]  n      Back substitution RTH [s]  Trace assembly RTH [s]  Total RTH [s]  Preassembly RTH [s]  experiment       d      k      l2_err_norm  num_cells 
     │ Float64              Int64  Float64                    Float64                 Float64        Float64              String           Int64  Int64  Float64      Int64     
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │         0.000952376      5                  0.0137373               0.0119081      0.0522225            0.0249174  solve_darcy_rth      3      0  1.70403e-14        125
   2 │         0.0189655       10                  0.129788                0.0588131      0.242794             0.0329939  solve_darcy_rth      3      0  3.84128e-14       1000
   3 │         0.140136        15                  0.205553                0.214513       0.604726             0.0444707  solve_darcy_rth      3      0  6.41082e-14       3375
   4 │         0.522282        20                  0.340954                0.531713       1.46237              0.0674227  solve_darcy_rth      3      0  9.58977e-14       8000
   5 │         1.42999         25                  0.670754                1.06884        3.4215               0.112977   solve_darcy_rth      3      0  1.67437e-13      15625
   6 │         5.0095          30                  1.03514                 2.06824        8.38236              0.187149   solve_darcy_rth      3      0  1.42677e-13      27000

I am bit puzzled by these results. RT-H is much faster than RT!!!! For 3D problems, the difference is in the Solve stage. But I think that the fact that the pull back of all physical space functions in RT-H is just the function in reference space also plays a role. We need to do more profiling to understand the difference (perhaps RT is just slow), but the results are promising!!!

amartinhuertas commented 3 years ago

Another important check i usually do is using Profview.jl to look for type instabilities and spurious allocations.

I just started using Profview.jl (seems quite interesting). Still familiarizing with it. Snapshot below and profile data attached (just in case you want to take a look at it when you come back from your holidays, no hurries). Do you know how can I find type instabilities using this tool?

Screenshot from 2021-07-14 22-44-49 profile.gz

fverdugo commented 3 years ago

The rule of thumb is: red-ish colors == type instability, orage-ish colors == heap allocations.

fverdugo commented 3 years ago

Hi @amartinhuertas ! By looking at the flame graph it seems that there is quite room for cutting down cpu times. Most of the "red" and "orange" functions at the top of the flames are taking a significant portion of the time. The bad news is that fixing this is not always trivial 😢

For reference, this is how look the flame graphs of the assembly loop for the poisson equation before (v0.15) and after (v0.16) fixing some annoying type instabilities. The goal is to get rid of the "red" color in the inner calls. Some "red" functions in the outer calls is OK and inevitable in some cases.

v0.15

fig_profile_15

v0.16

fig_profile_16

amartinhuertas commented 3 years ago

Hi @amartinhuertas ! By looking at the flame graph it seems that there is quite room for cutting down cpu times.

Ok, great.

Most of the "red" and "orange" functions at the top of the flames are taking a significant portion of the time.

Is there any references where I can read which is the meaning of the colors? In the README file of https://github.com/timholy/ProfileView.jl it mentions red but does not say anything about, e.g, orange.

For reference, this is how look the flame graphs of the assembly loop for the poisson equation before (v0.15) and after (v0.16) fixing some annoying type instabilities. The goal is to get rid of the "red" color in the inner calls. Some "red" functions in the outer calls is OK and inevitable in some cases.

Ok, I have many red inner bars. Is there a tool to investigate further the issue? (I know @code_wantype, but detecting type instabilities with it is quite cumbersome/time consuming).

fverdugo commented 3 years ago

Is there any references where I can read which is the meaning of the colors? In the README file of https://github.com/timholy/ProfileView.jl it mentions red but does not say anything about, e.g, orange.

https://timholy.github.io/FlameGraphs.jl/stable/reference/#FlameGraphs.FlameColors

Ok, I have many red inner bars. Is there a tool to investigate further the issue? (I know @code_wantype, but detecting type instabilities with it is quite cumbersome/time consuming).

This is an art.... I try to find a MWE that reproduces one of the (possibly many) inner "red"/"orange" bars, try to fix this, and then move to the next one. To build the MWE, @profview and print_op_tree are you friends. Instead of profiling a loop implemented in Gridap, i usually extract the lazy array, eg. arr = a(du,dv)[Ω], and profile a custom loop (this further simplifies the flame graph). E.g., @profview myloop!(cache,arr), where

cache = array_cache(arr)
@noinline function myloop!(cache,arr)
  l = 0
  for i in 1:length(arr)
    ai = getindex!(cache,arr,i)
    l += length(ai)
  end
  l
end

Once we have a really small MWE, you can start using @code_warntype

fverdugo commented 3 years ago

I forgot to mention that I build the MWE, by isolating branches of the tree. i.e., arr1 = arr.args[1] to see if the problem is in the first branch of the tree. print_op_tree helps to navigate in the tree

amartinhuertas commented 3 years ago

Ok, great. Thans for the advice. I will look into it whenever I have some time and let you know!

amartinhuertas commented 3 years ago

Ok, great. Thans for the advice. I will look into it whenever I have some time and let you know!

Revamped flame graph below after several optmizations! Note there are still four small red intervals. These are caused by the following member variable being virtual.

https://github.com/amartinhuertas/ExploringGridapHybridization.jl/blob/7b704bb5709947adef4e86752a05c49a7e122459/src/CellBoundaryOpt.jl#L51

Do u know if & how this can be waived?

Screenshot from 2021-07-22 02-54-23

fverdugo commented 3 years ago

Use a new type param, eg, facet_vector::A

amartinhuertas commented 3 years ago

Use a new type param, eg, facet_vector::A

Ok, done. It was a bit more involved though, as an additional member variable storing the eltype of A is also needed to specify the parameter type in the inheritance relation. see https://github.com/amartinhuertas/ExploringGridapHybridization.jl/commit/6c883841fb985a39a3260754b40337ad17ca24a3 for more details

Screenshot from 2021-07-22 18-54-56

fverdugo commented 3 years ago

Hi @amartinhuertas, you don't need the extra variable (I would remove it). It is possible to have type parameters not linked with any member variable. There are several options to initialize the "dangling" type parameter, but the easiest is to use the inner constructor.

struct CellBoundaryVectorFromFacetVector{T,G,C,V} <: AbstractVector{Gridap.Fields.VectorBlock{T}}
  glue::G
  cell_wise_facets_ids::C
  facet_vector::V
  function CellBoundaryVectorFromFacetVector(glue,cell_wise_facets_ids,facet_vector)
    G=typeof(glue)
    C=typeof(cell_wise_facets_ids)
    V=typeof(facet_vector)
    T=eltype(V)
    new{T,G,C,V}(glue,cell_wise_facets_ids,facet_vector)
  end
end

fverdugo commented 3 years ago

This is the other option

struct CellBoundaryVectorFromFacetVector{T,G,C,V<:AbstractVector{T}} <: AbstractVector{Gridap.Fields.VectorBlock{T}}
  glue::G
  cell_wise_facets_ids::C
  facet_vector::V
end

amartinhuertas commented 3 years ago

Hi @amartinhuertas, you don't need the extra variable (I would remove it).

Ok, done. Thanks! I did not know that this was legal.

amartinhuertas commented 3 years ago

For completeness, updated results after optimization. Only mild performance improvements achieved.

2D RT-H (versus https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/5#issuecomment-879132612)

10×11 DataFrame
 Row │ Trace solve RTH [s]  n      Back substitution RTH [s]  Trace assembly RTH [s]  Total RTH [s]  Preassembly RTH [s]  experiment       d      k      l2_err_norm  num_cells 
     │ Float64              Int64  Float64                    Float64                 Float64        Float64              String           Int64  Int64  Float64      Int64     
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │         0.000425657     10                 0.00584436              0.00739025      0.0229828           0.00932253  solve_darcy_rth      2      0  3.39898e-14        100
   2 │         0.00158013      20                 0.0127182               0.0250612       0.0542051           0.0131142   solve_darcy_rth      2      0  7.87394e-14        400
   3 │         0.00320832      30                 0.0231626               0.0454528       0.0867651           0.0140094   solve_darcy_rth      2      0  1.23727e-13        900
   4 │         0.00641192      40                 0.0799717               0.066508        0.174519            0.0138469   solve_darcy_rth      2      0  2.32726e-13       1600
   5 │         0.0147358       50                 0.138062                0.102409        0.274801            0.0175314   solve_darcy_rth      2      0  4.20234e-13       2500
   6 │         0.0395999       60                 0.144887                0.140463        0.348583            0.0204034   solve_darcy_rth      2      0  1.24761e-12       3600
   7 │         0.0313974       70                 0.192398                0.201534        0.448329            0.0220114   solve_darcy_rth      2      0  1.25006e-12       4900
   8 │         0.0485266       80                 0.214092                0.252164        0.543468            0.0250989   solve_darcy_rth      2      0  1.37409e-12       6400
   9 │         0.0843805       90                 0.24775                 0.309685        0.669939            0.0279402   solve_darcy_rth      2      0  3.50109e-12       8100
  10 │         0.0922157      100                 0.294551                0.397105        0.817754            0.0305282   solve_darcy_rth      2      0  1.27853e-12      10000

3D RT-H (versus https://github.com/amartinhuertas/ExploringGridapHybridization.jl/issues/5#issuecomment-879650335)

 Row │ Trace solve RTH [s]  n      Back substitution RTH [s]  Trace assembly RTH [s]  Total RTH [s]  Preassembly RTH [s]  experiment       d      k      l2_err_norm  num_cells 
     │ Float64              Int64  Float64                    Float64                 Float64        Float64              String           Int64  Int64  Float64      Int64     
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │         0.000936804      5                 0.00872965               0.0119614      0.0542423            0.0293648  solve_darcy_rth      3      0  1.70403e-14        125
   2 │         0.0184638       10                 0.113316                 0.0615835      0.229494             0.0354623  solve_darcy_rth      3      0  3.84128e-14       1000
   3 │         0.141921        15                 0.199493                 0.213998       0.602686             0.044557   solve_darcy_rth      3      0  6.41082e-14       3375
   4 │         0.607383        20                 0.33137                  0.583792       1.61177              0.0701601  solve_darcy_rth      3      0  9.58977e-14       8000
   5 │         1.70309         25                 0.657884                 1.28078        3.81652              0.117159   solve_darcy_rth      3      0  1.67437e-13      15625
   6 │         5.25552         30                 1.02274                  2.17287        8.70974              0.187296   solve_darcy_rth      3      0  1.42677e-13      27000