Improve MGSmootherPrecondition and related multigrid operations

kronbichler commented 8 years ago

As indicated in #2328 I would like to re-design the multigrid algorithms in case we implement smoothing via preconditioners (rather than smooth operations that you can do for the relaxation case). One thing that I would like to get rid of is GrowingVectorMemory inside MGSmootherPrecondition (see #2533 for an example where this can be problematic). The design that we 'allocate' a new vector through the pool for every level smooth makes us end up with throwing away and re-allocating the communication structure in parallel::distributed::Vector in every smoothing (well, it gets shared but there is still considerable setup work necessary). A side effect was that I ran into a bug of the persistent MPI communicators in parallel::distributed::Vector in a certain MPI implementation of my local supercomputer through this design.

Secondly, MGSmootherPrecondition tries to skip some matrix-vector products in case the rhs vector is all zero (which is the start setting a V-cycle). While this optimization pays off usually, on very large computations the VectorType::all_zero() call that does an Allreduce can actually be more expensive than the mat-vec. It depends on the MPI network and the cost of the mat-vec; I found it in my profiles first on >100k MPI procs. Rather than do one or the other, we could have both if we transfer the knowledge of when a vector is zero to the place where we call the underlying preconditioner of MGSmootherPrecondition.

When thinking about the temporary vectors in the multigrid algorithm, one further realizes that one can re-use the temporary vectors in the V-cycle and the smoothing operations, which I find a nice side effect. Using only 4 temporary vectors rather than 6 is a welcome side effect in my opinion.

Before posting code (I have an initial implementation), I want to discuss the algorithm on a more general level and what kind of implementation we want to have. Here is the implementation of a v cycle as I would like to have it:

    if (level==minlevel)
      {
        (*coarse)(level, solution[level], defect[level]);
        return;
      }

    (*pre_smooth)[level].vmult(solution[level], defect[level]);
    (*matrix)[level].vmult_interface_down(t[level], solution[level]);
    t[level].sadd(-1.0, 1.0, defect[level]);

    // transfer to next level
    transfer->restrict_and_add(level, defect[level-1], t[level]);

    v_cycle(level-1);

    transfer->prolongate(level, t[level], solution[level-1]);
    solution[level] += t[level];
    // smooth on the negative part of the residual
    defect[level] *= -1.0;
    (*matrix)[level].vmult_add_interface_up(defect[level], solution[level]);
    (*post_smooth)[level].vmult(t[level], defect [level]);
    solution[level] -= t[level];

(With suitable extensions for multiple steps, W-cycles, and F-cycles.)

This algorithm differs in three ways from the current implementation in multigrid.templates.h (the level_v_step part):

I pass the smoothers in as 'preconditioner' objects with a vmult operation rather than smooth. This has the advantage that I can explicitly make use of the fact that I want solution[level] to be zero at this stage (I both avoid the implicit zero operation in reinit() and the check for zero later).
I assume operations vmult_interface_down that merges the usual residual operation necessary after the pre-smoothing with the part of the interface matrix that we now do in separate functions. I find this more intuitive because it really represents a mat-vec on an extended matrix. With matrix-free evaluation, it is also faster because with the current library code I need to go through two matrix-vector products rather than one. Even the matrix-based evaluation would profit the same way since we do not build separate (lean) sparsity structures for the interface matrix but rather full sparsity patterns that are later only filled with interface information. On the other hand, we would need a slightly different setup of those matrices. For backward compatibility, we could provide functions that wrap the old structure into the vmult_interface_down and vmult_add_interface_up calls.
I implement the multigrid as a full approximation storage scheme where I transfer the whole residual at once. As we discussed elsewhere, the old implementation did a transfer to all coarser level on every level, which means that a v-cylce does O(n_levels) transfers to the coarse grid rather than just one. Since the coarse grid transfer is global communication (you can think of it as MPI_Reduce conceptually, except that not the MPI library controls the way you recursively sum up the individual values but the grid hierarchy), you don't want to do that too often at this stage of the algorithm. With the old implementation, I could not go below approx 0.02-0.03 s per v-cycle on 150k procs. The new implementations allows me to reach 0.01s on similar problems. More generally, I can improve performance by 5-50%. The smaller the local problem size, the better the speedup. Put differently, the new algorithm scales weakly reasonably well with 40-50k dofs per core when going from 288 cores to 147k cores (more than 60% parallel efficiency). The parallel efficiency of the same test dropped to less than 40% on the old code and only reach 60% at 150k dofs per core.

The above algorithm also requires changes in the transfer algorithms because we need to be able to disable the initial transfer of the residuals to the coarser levels. I think of adding a method enable_fas_restriction that sets a flag in the copy_to_mg methods of the transfer. We could default it to false for the old multigrid solver class and to true in the new one.

@guidokanschat @tjhei @masterleinad you have experience with MG algorithms. Any thoughts (or wishes for further changes)?

tjhei commented 8 years ago

Let me make a few general comments:

We should not be afraid to change the MG classes. They have not seen much use over the last year and they might be used quite a lot as soon as people figure out how awesome the scalability and even single node performance is. Therefore, I do not worry much about backwards-compatibility here and prefer a more performant and nicer interface.
This is complicated stuff and I do not have the whole picture in my head. It would be great to get everybody involved into the same room (at least virtually).
You have by far the best knowledge about performance/scalability in this area so I trust you in your suggestions.

vmult_interface_down [...].

How is this done right now?

On the other hand, we would need a slightly different setup of those matrices.

The current way to construct the interface matrices is non-intuitive anyways, so :+1: if we change that.

kronbichler commented 8 years ago

vmult_interface_down [...].

How is this done right now?

You compute the local part of the residual through Smoother::smooth() and the remainder comes in via the interface matrix. In continuous FEM, it represents the matrix-vector product on a particular level that puts Dirichlet conditions on the rows of all degrees of freedom, including interior dofs, except the ones that are on the interface between the levels. Combined with a matrix that has Dirichlet conditions on the interface, the combination of the two is a matrix-vector product where the columns have Dirichlet conditions both on the other boundary and the interface between different levels, and the rows have only Dirichlet conditions on the other boundary. (All Dirichlet conditions being homogeneous since this is smoothing.)

This is what I expect the vmult_interface_down to do in one step. In matrix-free evaluation, I implement this by manipulating the entries in input and output vectors to the desired value.

bangerth commented 8 years ago

This is complicated stuff and I do not have the whole picture in my head. It

would be great to get everybody involved into the same room (at least virtually).

This may simply be the two of you these days, plus potentially @guidokanschat. My knowledge of the MG parts is essentially zero. But you're both excellent programmers and software designers, so come up with what you think is right!

davydden commented 8 years ago

I implement the multigrid as a full approximation storage scheme

don't you need an extra vector for FAS, namely the defect correction \tau = A^2h I^{2h}_h v^h - I^{2h}_h A^h v^h ? Maybe to avoid confusion in names, in the algorithm outlined above change defect to residual?

davydden commented 8 years ago

Speaking about rewriting, i think some classes can be removed/depricated, for example inside the current level_v_step we use mg::Matrix<vector_t>, whereas for assembly outside the level matrices are stored in MGLevelObject<matrix_t>. I suppose one could completely avoid using the former and stick with (*matrix)[level].vmult_interface_down(t[level], solution[level]);.

kronbichler commented 8 years ago

@davydden Your probably right that in the context I try to implement it's rather residuals than defects - it's been a while since I last time checked the appropriate naming for multigrid algorithms ;-) but I will when I get back to my office.

I guess a bit of the notational problem is that at entry, the vector defect holds the initial right hand side or source vector. After pre-smoothing, the variable t holds the residual between input (rhs) and the solution. After the coarse grid correction comes in, defect holds the residual and t corrects the solution. Since the semantics change a bit (but I don't want to spend more vectors than necessary), it's probably not wrong to have less descriptive names and go for generic tmp1, tmp2, tmp3 (I will need that if multiple cycles are requested). But let me do my homework first and check the common notation.

Regarding mg::Matrix: I think the intention is that this data structure tries to avoid templates. I agree that we would probably not need it since the algorithms themselves are quite compact and so having them in the header file is not too bad. (We need templates on the vectors anyway.)

kronbichler commented 8 years ago

OK, I checked the notational details in the multigrid book by Trottenberg et al. Firstly, the scheme outline above is not a full approximation scheme because we deal only with corrections and not the full solution on the coarser level. (In other words, the algorithm still assumes linearity in the operator; FAS needs indeed to transfer both the current solution and the residual to the next coarser level.)

The way my proposed algorithm differs from the multigrid algorithm implemented by Guido is just how the defect is transfered. I choose to transfer the defect at once (which is also the algorithm stated in Trottenberg for the basic multigrid cycle) whereas the current MG algorithm transfers the right hand sides or parts of the right hand sides as soon as they are available (I guess this goes back to the notation by Bramble as there are some indications in the in-code comments; I have not checked Bramble's work closely, though).

Now that I have my notation fixed (thanks @davydden for the critical comment), this algorithm is a usual correction algorithm. My group (or someone in @guidokanschat's group) will at some point also implement a FAS algorithm for nonlinear problems but that requires a slightly different setup. Anyway, I try to make the design such that it is intuitive to extend in that direction.

This is complicated stuff and I do not have the whole picture in my head. It would be great to get everybody involved into the same room (at least virtually).

@tjhei and @guidokanschat I guess we should have a hangout at some point. There is no hurry, so I would do it either next week or the week after that, depending on @guidokanschat's schedule.

davydden commented 8 years ago

@kronbichler when you will be discussing, may I suggest to keep in mind multigrid applied to multi-vectors (say a Trilinos multivector)? That's something i would like to look at in not too distant future after the MG classes are updated.

davydden commented 8 years ago

@kronbichler more things which can be improved: MGCoarseGridBase is probably not necessary. As long as there is an object with vmult, it should be ok. I would actually suggest to keep in mind LinearOperator as it depricated IterativeInverse. Same applies to matrices at each GMG level where in some situations it is good to consider something like Kx=Ax+cBx (with c being a constant). Of course smoothers would have to work with diagonal only in this case. But otherwise the rest would be exactly the same.

davydden commented 8 years ago

for example currently one can't use on the coarse level Jacobi preconditioner from TrilinosWrappers in MGCoarseGridLACIteration due to the reliance on pointer_matrix.h. No way to do AMG preconditioner on the coarse level either, i suppose.

davydden commented 8 years ago

another usage scenario is when the coarse grid solver actually does not solve anything but merely applies a couple of steps of a preconditioner, e.g. Jacobi. This is the case when GMG is applied as a preconditioner to improve steepest descent directions.

kronbichler commented 8 years ago

Given that @masterleinad has joined us in re-working MG algorithm, we should definitely strive for getting this done. I will try to reshape my algorithm outlined above to a form close to the current version (and check that DG with adaptive meshes can be done in my new infrastructure) when I'm back from my holiday trips in two weeks.

My algorithm outlined above has some specialized routines for preconditioner-based routines which save a matrix-vector product or a all_zero() as compared to MGSmootherPrecondition on each smoothing application, which is definitely good with respect to large computations. It also does the restriction differently.

The two steps that are currently missing in my re-coding is how to deal with relaxation-based smoothers which need a different design and the interface to matrices and preconditioners (or relaxation). I really like the idea of going through an abstract class like mg::Matrix because one can could use different preconditioner types on different levels, whereas my first approach has a template on the outer preconditioner class and is thus more limited in flexibility (virtual functions are indeed much better than templates at that stage).

bangerth commented 8 years ago

Did you think of maybe using the LinearOperator framework to encapsulate matrices (instead of, or maybe in combination to, mg::Matrix)?

davydden commented 8 years ago

Did you think of maybe using the LinearOperator framework to encapsulate matrices (instead of, or maybe in combination to, mg::Matrix)

i was thinking along the same lines above https://github.com/dealii/dealii/issues/2534#issuecomment-216890476

appiazzolla commented 8 years ago

@kronbichler Sorry to bother but your comment in #3092 got me curious:

I would definitely keep base classes with virtual inheritance at least on the smoother side

Isn't that preventing inlining?

kronbichler commented 8 years ago

@appiazzolla Yes, it is preventing inlining. But at the level of a smoother application I cannot imagine a situation where it would matter at the slightest. Even on the coarsest levels smoothing requires more than a few thousands of CPU cycles. When going to big parallel computations, inlining and virtual function calls maybe cost you 2-5 ns. But network latency (or synchronization cost in case you are in shared memory) is 1-50 us, a factor of 500-10000 more.

Thus, I prefer flexibility and readability at those levels.

appiazzolla commented 8 years ago

@kronbichler Thanks for your answer! Some comments:

But at the level of a smoother application I cannot imagine a situation where it would matter at the slightest

What if someone wants, for instance, to implement a very light Jacobi-like preconditioner smoothing each cell by calling a cell-smoother on all cells instead of smoothing all the cells in a single smoother call. If one uses a low degree finite element, the operations performed by the smoother would be minimal and without inlining and loop unrolling some overhead wouldn't be impossible, right?

One could, of course, impose an architecture to such a user that the smoother should not be called repeatedly but isn't that actually harming flexibility?

When going to big parallel computations, inlining and virtual function calls maybe cost you 2-5 ns. But network latency (or synchronization cost in case you are in shared memory) is 1-50 us, a factor of 500-10000 more.

Agreed. But big parallel computations are not the only application of multigrid, in many cases it improves performance in single core applications and can provide robustness with respect to parameters.

Thus, I prefer flexibility and readability at those levels.

I would be curious about why using templates for smoothers harms flexibility and readability so seriously. I tend to think that templates provide more flexibility in fact and by using typedefs appropriately readability should not be harmed so badly.

Thanks! This thread is quite interesting!

kronbichler commented 8 years ago

I would be curious about why using templates for smoothers harms flexibility and readability so seriously. I tend to think that templates provide more flexibility in fact and by using typedefs appropriately readability should not be harmed so badly.

I was thinking about the case that you use different smoother algorithms on different multigrid levels. If the central multigrid algorithm is templated on the smoother type, you are out of luck (at least as long as you do not hardcode one smoother for fine level and another one for intermediate levels). I must admit that I never used those with the usual h-multigrid, but I've seen people using different smoothers. On the other hand, when working with p-multigrid, I do combine very different smoothers regularly. But I write the p-coarsening part (where you go from high-order bases down to linear or constant functions) out explicitly in my codes.

kronbichler commented 8 years ago

What if someone wants, for instance, to implement a very light Jacobi-like preconditioner smoothing each cell by calling a cell-smoother on all cells instead of smoothing all the cells in a single smoother call. If one uses a low degree finite element, the operations performed by the smoother would be minimal and without inlining and loop unrolling some overhead wouldn't be impossible, right?

You have a point here, but you would use completely different multigrid algorithms for such approaches. The ones we expect to feed are indeed assuming that you smooth the whole mesh (of a respective level).

appiazzolla commented 8 years ago

I was thinking about the case that you use different smoother algorithms on different multigrid levels.

Couldn't you use a CRTP idiom in such a case?

You have a point here, but you would use completely different multigrid algorithms for such approaches.

I have to believe you on this because I don't see why.

Thanks!

kronbichler commented 8 years ago

You have a point here, but you would use completely different multigrid algorithms for such approaches.

I have to believe you on this because I don't see why.

Check in include/deal.II/multigrid/multigrid.templates.h, method level_v_step(level), where we call pre_smooth->smooth() and post_smooth->smooth(). This assumes that you operate on the whole mesh, at least when you combine it with the transfer->restrict_and_add and transfer->prolongate that are based on the whole mesh.

Of course, you could implement what you say inside a loop within smoother->smooth(), but deal.II does not currently provide an abstraction for that. We do have cell-based (matrix-free) algorithms, but when going to multigrid we assume to get the collective smooth operation that represents one or several matrix-vector products (and maybe other stuff).

Couldn't you use a CRTP idiom in such a case?

That's a nice concept, but it's a bit more involved for a user to implement. If necessary, sure, one could do that, but it's not on my agenda at least. (Note that deal.II is already quite aggressive with inlining already when you compare it to other big C++ projects in numerical analysis - apart from the specialized things for one or a few building blocks like boost or eigen of course.)

appiazzolla commented 8 years ago

Check in include/deal.II/multigrid/multigrid.templates.h [...]

I'm sorry! I thought the idea was to improve Multigrid<VectorType> as well! You are absolutely right then.

That's a nice concept, but it's a bit more involved for a user to implement. [...]

Looks quite simple to me but OK. I suggested it because AFAIK, projects like DUNE and FLENS use it extensively.

Thanks!

kronbichler commented 8 years ago

I'm sorry! I thought the idea was to improve Multigrid<VectorType> as well!

We do try to improve that, but given the way we construct the transfer and the global approach assumed there, I think it is quite a far stretch to think about those operations. But I definitely welcome your comments once I've come to the point of providing new code and not just writing about my "ideas".

kronbichler commented 8 years ago

Given the different opinions expressed in #2948, I think we should discuss the design of the Multigrid class and get some agreement. What are the alternatives for constructing the Multigrid class:

The current design has a template only on the vector type to be used inside the multigrid cycle. This means we need to store base classes to the respective operators for the matrix-vector product, the coarse grid solver, the transfer, and pre- and postsmoother. This is done through the four classes MGMatrixBase, MGCoarseGridBase, MGTransferBase, and MGSmootherBase, which the user needs to construct explicitly.
In order to avoid base classes, we re-design the class completely and make all those arguments of the multigrid class templates. However, having to put five (or six) template parameters on the class, some of which might contain several template parameters on their own, will give compile errors that are easily ten lines long. This used to be one of my favorites when starting this issue but thinking about it this seems a bit awkward.
We could add a templated constructor that takes templates on all those arguments. This way, we could both accommodate for getting an MGLevelObject or the pointers directly, depending on how the user structures the code. Still, in case some arguments are the wrong format, the 'experience' with this variant will be as bad as with variant 2.
Assume C++11 and go with MGLevelObject<LinearOperator<VectorType> > for both the matrix and the smoother objects and LinearOperator<VectorType> as suggested by @davydden and later @bangerth . Without having looked to deep into it and lacking experience of e.g. how error messages look like in case something goes wrong, I guess this is still much better than 2 and 3.

The nice thing about 2 and 3 is that the average user will not have to bother with mg::Matrix and skip a few (seemingly useless) declarations. This saves a few lines of code and avoids going through another round of objects. But templates obscure the API and the requirements on the classes you hand in. In case something goes wrong, a user has to look deep inside the implementation and try to figure out what happens and which class it was that was missing a particular Tvmult_add method for example. (That could be addressed by concepts, but I'm not sure we want to wait for them ;) )

Given that we are not yet in the position of going only with variant 4 (C++11) for at least one more release, I prefer to (also) improve on variant 1 as of now. But it would definitely make sense to start constructors for 4 as well as it will be the more flexible solution. Any opinions?

davydden commented 8 years ago

I think (2) and (3) will not be flexible enough to allow for different smoothers on different levels. Something along the same lines is the combination of h-multigrid and p-multigrid where the smoother on the finest level is done on, say, quadratic elements and everything else is using linear elements on h-refined meshes (not that it's is currently possible, but just in concept).

I like the idea of looking at both (1) and (4).

In my opinion (4) is more versatile for complicated cases like when an operator is a linear combination of two matrices A+cB. It should also work with MatrixFree. One could of course argue that the same can be achieved by writing custom classes for (1), but why should we force users to do this if there is an easier way already within the library?

The only issues with (4) i could think of is performance in case someone has a fancy linear operator. But for a LInearOperator consisting of a single matrix (or MatrixFree), from my understanding it should not result in any performance penalties compared to MGMatrixBase. Maybe @luca-heltai or @tamiko can comment on LinearOperator's side as a possible implementation strategy for geometric multigrids.

As for improving (1), my understanding is that MGMatrixBase is used to derive two classes -- one for direct application of a matrix (mg::Matrix< VectorType >), and one for a single block of a block matrix (MGMatrixSelect<MatrixType, number>). Maybe the two cases can be substituted by MGLevelObject<matrix_t> via (3) approach (templated constructor with MatrixType)?

guidokanschat commented 8 years ago

Seems I should have joined this long ago. It got lost in the general communication overload of the Dean of Studies. Sorry!

Let me comment first on the original issue: I strongly recommend to use mg::SmootherRelaxation. Its implementation can be made more efficient, compare for instance direct implementation of Gauss-Seidel iteration to the preconditioner.

Evenmore, the preconditioning interface does not work with overlapping Schwarz methods. Which brings us to the issue that a smoother is an affine operator, not a linear one. Therefore, option 4 above would not work.

That being said, I have no personal interest anymore in the preconditioner interface to smoothers and do not object to any changes there.

Comments on other points soon...

kronbichler commented 8 years ago

I strongly recommend to use mg::SmootherRelaxation. Its implementation can be made more efficient

That depends on the type of smoother. If I have control over the matrices, then I absolutely agree. However, I did not see this working for PreconditionChebyshev (that's how we call the Chebyshev acceleration of Jacobi). This is the most important smoother in my applications right now and good performance is critical because it gets used in large CFD simulations. I hope to replace it by something better but we have yet develop those fast overlapping Schwarz implementations and fast smoothers - it's on my agenda as you know ;-)

The same problem arises when using external smoothers, say Trilinos SSOR (Gauss-Seidel as they call it) which comes through a preconditioner interface rather than a smoother interface.

As I see it right now, we would need two slightly different multigrid algorithms for the preconditioner and relaxation based smoothers if we want to provide optimal paths for them both. Not optimal but given that a V-cycle is so short it would not be terrible. I have to explore those options a bit more to see how to best meet these requirements and get the preconditioner based variants without the current overhead (avoiding all_zero() yet avoiding the mat-vec on a zero vector, memory management).

guidokanschat commented 8 years ago

Now to the more general redesign suggested in the later discussion. I am not too proud of the multigrid interface and a redesign has been on my mind for a while.

Some items to keep in mind

FAS: I agree with others that we should consider FAS when we think about redesigning. While I do not believe that both linear and nonlinear multigrid will inevitably end up in the same object, they should be as close as possible. And the lessons learned from implementing FAS will make this redesign a much stronger improvement.
Heterogeneous multigrid: meaning different operators on different levels. Not necessarily hp, but we should keep it in mind.
MFTransfer is currently a schizophrenic class. It does two fairly unrelated things: copying a global vector into level vectors and the level transfer inside multigrid. Those two should be separated (adding one parameter to Multigrid).
global_to_level and level_to_global: allow for different discretizations for global and multigrid. That will enable 2-level methods where the levels are not distinguished by mesh levels. Examples: High order discretization and low order multigrid (I hate it, but popular); radiative transfer, where multigrid is only needed for one out of many coupled components.
Level transfer: I am convinced that as soon as we do multigrid for nonlinear problems, even for the Newton linearization, in a cochain setting, the solution should be projected by the commuting projectors, not by the currently built-in $\ell_2$-projection. That is, the grid transfer operator should have a plug-in for other restrictions or even prolongations. Grid transfer in this case must also take into account that we transfer functions and forms, namely condensed and distributed vectors.
What happens at the interface? The implementation of the grid transfer in conjunction with local smoothing is a mess. Whenever you touch the transfer operators, it will hit you. Multigrid is not very well covered by the test suite and that's something we should look into before making major changes.
There is still an issue with elements that are continuous as well as discontinuous. But I hope to be able to generate enough interest in Heidelberg to address this soon.

General overhaul of Multigrid class

Among Martin's four suggestions.

I like (1) because I wrote it ;-) Seriously, unless a complete overhaul induces a major improvement, we can keep it as it is.
I do not think that (4) will work with overlapping Schwarz smoothers (see last comment)

Leaves us with (2) and (3), which seem to be syntactic variants of the same option. After a heated discussion with @appiazzolla and @masterleinad, I see indeed a point in the unrolling/inlining issue (yes, I can fight and learn at the same time, but not with the same speed). When looking at multigrid in a parallel context, there is a lot of synchronization. The algorithm synchronizes after each smoothing step, when it computes a residual and at each level transfer. In the context of local smoothers (Schwarz), his is not necessary. We can project to the coarse grid as soon as all smoothing operations of all children of a cell and their neighbors are finished. Same going back up, where we even do not need the neighbors.

Avoiding this synchronization involves reshuffling different mesh loops. I am not the one who wants to implement the scheduling, but it would make a redesign absolutely worthwhile.

Rethinking multigrid

We are always considering Multrigrid as a preconditioner for a fine level problem. But, what about the other way around? Fine levels contribute as improvements to the coarser levels, such that only the differences to coarse level solutions are computed there. This is basically what is used in the 2-level Schwarz theory (at least in the older version with Ritz projections), where you show that the smoother is a solver on the quotient space. The Peano code developed at TUM in the 90s did something like that. They even went so far to use floating point numbers with less accuracy on the finer grids.

guidokanschat commented 8 years ago

@kronbichler Kill the all_zero() the following way:

the start vector on each level is zero
the result of every affine smoother with zero input is the right hand side
use a flag to reduce the number of pre-smoothing steps by one

bangerth commented 8 years ago

I don't know much about the MG code (any more), so am only going to weigh in from a software design viewpoint: I'm not all that much a fan any more of making things templates. We have function objects, LinearOperator, and (for things that can afford the virtual function call) base classes that (i) lead to better error messages, and (ii) are more flexible if you want to use different kinds of objects on various levels. If you go with templates, you have to use the same object on every level (the object's functions may of course switch internally depending on the level, but that's not exactly elegant).

So from a design viewpoint -- writing easy to understand and easy to learn code -- I'm quite in favor of options 1 and 4 of @kronbichler 's list.

bangerth commented 8 years ago

(In fact, I think option 4 with LinearOperator is the most elegant. We may not (yet) be able to use it, but we could see if maybe we can come up with a design that makes such a switch relatively simple later on.)

davydden commented 8 years ago

@guidokanschat

Which brings us to the issue that a smoother is an affine operator, not a linear one. Therefore, option 4 above would not work.

The fact that the operator is not a linear IMHO will not prevent using LinearOperator class (yes, sounds strange, i know :smile: ). The class is, of course, intended for linear operators [1], but from the programming perspective all we need is its std::function< void(Range &v, const Domain &u)> vmult.

I am still convinced that this is the most flexible and feature rich approach.

[1] things like const auto op = op_a + k * op_b

davydden commented 8 years ago

p.s. affine operators like f(x)=Ax+b would need to be wrapped in LinearOperator class to provide vmult, but it should be possible (@tamiko @luca-heltai am I correct here?). One, of course, should not use this object to define another LinearOperator object or try using Tvmult.

note: i edited the last two message to fix typos.

davydden commented 8 years ago

Maybe we should introduce a class AffineOperator with std::function<void(Range &, const Domain &)> vmult;, std::function<void(Range &, const Domain &)> vmult_add; and bare minimum needed here taken from LinearOperator. Then we derive LinearOperator from it. In this case everything will be consistent, flexible and without hacks. One will be able to use either with GMG.

tamiko commented 8 years ago

On Fri, Sep 16, 2016, at 13:42 CDT, Denis Davydov notifications@github.com wrote:

p.s. non-affine operators like f(x)=Ax+b would need to be wrapped in LinearOperator class to provide vmult, but it should be possible (@tamiko @luca-heltai am I correct here?). One, of course, should not use this object to define another LinearOperator object or try using Tvmult.

[... without having read any of the context ...]

We deliberately decided to not do that, i.e. LinearOperator has the underlying assumption that it is, well, a linear operator. Otherwise, semantics get very strange (at a few places the assumption that LinearOperator is a linear operator is used in the implementation - be warned).

If possible, use a "PackagedOperation" instead, and later on use it in a tight loop (with little overhead):

// declare temporary storage and define application Vector<...> x; const auto packaged_op = linear_operator(A) * x + b;

for( ... ) ... = package_op.apply();

If this is not what you want, or if you want to have the storage implications being explicit, create a lambda expression:

const auto affine_expression = [](const Vector... &x, const Vector... &b) -> const Vector{ return linear_operator(A) * x + b; };

(Do not store this as a std::function, use the lambda expression as auto type directly)

guidokanschat commented 8 years ago

May be I should clarify my statement in the way that smoothing is mathematically speaking an affine operator, but that it cannot be written as A*x+b without considerable loss of efficiency. Indeed, you can always write it as $P^{-1}(Ax-b)$, but that is of minimal value for the implementation. It also goes beyond the expression in the previous comment by @tamiko, since the action of $P^{-1}$ and $A$ are intertwined locally, not globally. Everybody who wants to go this way, please have a look at the error propagation operator of the multiplicative Schwarz method and rewrite it in terms of linear operators. For a simple Gauß-Seidel method, the penalty is already 50%, if you use MGSmootherPrecondition. For overlapping decompositions, it is much worse.

I agree with @bangerth, that base classes offer an interface for heterogeneous smoothers and a much more intuitive interface, which should only be given up, if it leads to performance loss or excessive coding overhead. Here, we talk about few classes. Not even every user of multigrid will write his own smoother class. Leaves us with performance loss. But the current multigrid structure is coarse grained, so this is not an issue either.

To my mind, the template/inlined version makes sense, if we completely rethink the multigrid method in terms of local operators. That is, performing the projection to the coarse grid right after smoothing has been finished on all children. Such a method may be our goal in a few years, but right now we do not know enough about dependency graphs to do this well.

My suggestion for right now: the group who favors templates could provide the derived classes for MGCoarseGridBase and mg::SmootherRelaxation that allow the preferred use of lambda expressions. Let's see where this goes.

kronbichler commented 8 years ago

@guidokanschat I see. I probably need to read a bit more about the relaxation-based smoothers as I seem to be overly skewed to preconditioner interfaces (I must have read too much Trilinos/AMG literature). At some point, I (and my student Niklas) need to learn more about the Schwarz smoothers and how far we can get with those, so consider me interested for a meeting in Heidelberg... ;-)

Given these considerations, I think we came to the conclusion that the base class interface, variant 1, is the appropriate interface for the medium term. I cannot realistically contribute much to the lambda expression interface in the next year as my two bigger projects are to get the matrix-free DG code and the DoFHandler reorganization (#2000) done.

I also agree with @guidokanschat that we should re-thinking the multigrid algorithms in terms of local interfaces at some point in the future. But the whole dependency graph issue is not exactly well-developed yet. I've tested several concepts and talked to a number of people and did not find performance evidence that requires an immediate departure from the coarse-grained approach that does one operator evaluation at a time (with MPI). As a matter of fact, things are not even really addressed in the structured finite difference case AFAIK where one would think that wavefront concepts (temporal blocking over several operator evaluations/smoothing steps) gives a clear performance benefit as you are memory bandwidth limited on the input/output vectors. Let alone our more unstructured/higher-order discretizations. My view is that we should first saturate some hardware performance metric with the classical 'one operator evaluation on the whole domain at a time'. We have quite a bit of work to do to get there on the evaluation side. Likewise, we need to figure out mathematical concepts of good smoothing in the high order case (and fast inversion of the Schwarz operators). I don't want to code a threaded (dynamic?) task management that resolves the performance issues I had with the tools I tried (be it bad cache usage or NUMA issues), not to mention task scheduling over the MPI network. (But I am happy to join anyone with interest on these topics.)

kronbichler commented 7 years ago

We have completed the most important steps of this issue, except for one unnecessary matrix-vector product at the start of the operation. I will postpone this for the next release 9.0.

dealii / dealii