Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Let redundancy groups not just fail #10190

Open nilmerg opened 1 month ago

nilmerg commented 1 month ago

This is not a bug, not yet, at least. Once #10014 is fixed, this should be considered.

tl/dr

Redundancy groups should not just fail, they should in addition be not reachable or not determinable, just like hosts and services are.

What's this about?

graph LR;
    ChildHost-->pa["ParentHostA (Group 1)"];
    ChildHost-->pb["ParentHostB (Group 1)"];
    pa-->ga["GrandParentA"];
    pb-->gb["GrandParentB"];

Right now, the child host will be unreachable in two cases:

  1. Both parents are down (expected)
  2. One of them is unreachable (The linked bug)

Once the second case is fixed, it will likely behave like this instead:

  1. Both parents are down (still, expected)
  2. Both parents are unreachable (new, expected)

But what happens in case only one of the parents is down and the other unreachable?

  1. One parent is down and the other unreachable (new, ??)

I suppose so, since already right now, the availability of the child host is influenced by its parent's reachable state. It will just be extended to consider redundancy groups in such a case.

But then there is Dependency::IsAvailable

Which is what is used to determine a redundancy group's state at the moment. Though, it doesn't consider the parent's reachable state at all. The parent might be UP, but unreachable, and it returns true.

So, in order to really ensure that the child host is unreachable in the third case, this might need to change. (The potential bug)

Of course, this has larger side-effects, as it changes how dependencies work in general.

Which is what this issue is really about! As I believe this is a good thing. Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway. This discrepancy will only pop up in case checks are disabled on the parent in question. (As if checks run, the parent will eventually go down/unknown)

Additional note

What led me to this, was the wish to understand how a redundancy group's state will be determined as part of https://github.com/Icinga/icingadb/issues/347.

As of today I thought it simply represents whether all related dependencies failed or not. But now a failed dependency might be failing, because a parent might just not be reachable. And so will a redundancy group.

This all sounds fine, unless you want to identify the actual root problem in a failing dependency chain. (A root problem is a dependency node, which is reachable and has a problem)

In a dependency chain, the current plan is, to represent redundancy groups as actual nodes with an actual state. While hosts and services also have state, they can in addition be not reachable. Redundancy groups can only fail.

But a redundancy group, with one host member that is UP but not reachable, is technically also not reachable. So it's not just a failing dependency, but one that is … not determinable?

yhabteab commented 2 days ago

Which is what this issue is really about!

To be honest, I don't quite understand what this issue is all about :). You have listed a number of open questions that don't quite fit the title of the issue.

Redundancy groups should not just fail, they should in addition be not reachable or not determinable, just like hosts and services are.

This particular case might be covered by #10227 as that issue is about syncing dependencies to Redis.

But what happens in case only one of the parents is down and the other unreachable?

Which is what is used to determine a redundancy group's state at the moment. Though, it doesn't consider the parent's reachable state at all. The parent might be UP, but unreachable, and it returns true.

So, in order to really ensure that the child host is unreachable in the third case, this might need to change. (The potential bug)

The Dependency::IsAvailable() method is concerned with verifying the availability of a given dependency, rather than determining the reachability of its direct or intermediate parents. With #10228, the child host will of course at least for me be unreachable. It is not entirely clear to me, however, why this particular case should be treated differently from the others.

I suppose so, since already right now, the availability of the child host is influenced by its parent's reachable state. It will just be extended to consider redundancy groups in such a case.

Isn't that what #10014 is about? That's a confirmed bug that will be fixed by #10228, but I don't see how that applies to this issue.

Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway.

This is already the case, all parents in a given dependency graph are evaluated recursively, but no redundancy groups have been taken into account so far.

But a redundancy group, with one host member that is UP but not reachable, is technically also not reachable. So it's not just a failing dependency, but one that is … not determinable?

Well, it's simply unreachable. Isn't the unreachable state intended for use cases like this? There's no need to add another meaning to it.

nilmerg commented 1 day ago

The main point of this issue is that redundancy groups don't have a reachability flag in the current design proposal. They need one however, as Web needs it to identify root problems. #10227 does not include this yet, as Icinga doesn't even consider it.

10228 does exactly miss the point I made here and introduces contradictory states: A host or service might now be unreachable, although every related dependency states it is available.

I don't know how #10227 determines a group's state at the moment, but let's assume it's a loop over all related dependencies which are asked if they're available. Which they all are.

And in this case, this contradiction is written to redis and subsequently to the database where Web will have a hard time following a hierarchy up to the actual root problem as when it reaches the group in question it stops, as it isn't failed. And of course, it didn't fail, it just doesn't have any member that is reachable.

And now let's assume every member of a redundancy group is itself dependent on another (grand) parent, just like in my example. The group only fails (and thus states the child isn't reachable) if all members are either unreachable themselves or in such a state that the dependency fails. But what if one of the members isn't in such a state that the dependency fails but it's (grand) parent is in such a state that their dependency fails? The answer: Nothing, unless the parent is checked. Which it won't if the parent <-> grandparent dependency disables checks.

--

I know this is a culmination of several issues. But I hope I made it clear now what I'm up to.

yhabteab commented 21 hours ago

The main point of this issue is that redundancy groups don't have a reachability flag in the current design proposal. They need one however, as Web needs it to identify root problems. #10227 does not include this yet, as Icinga doesn't even consider it.

Well, redundancy groups simply don't exist from Icinga 2's point of view, or at least they're not one of those things that can have states and I don't think it would make sense to change that. If in doubt, #10227, if I understood @oxzi correctly, will construct the redundancy groups virtually and sync them to Redis. So I definitely think this is something that should be fixed with or based on #10227's codebase.

10228 does exactly miss the point I made here and introduces contradictory states: A host or service might now be unreachable, although every related dependency states it is available.

With #10228, a host or service becomes unreachable when any of its parent dependencies become unreachable, so I don't know why this contradicts the point you made. After all...

Why shouldn't a dependency fail, in case another one higher up in the hierarchy fails? This is already the case, I think, anyway.

... isn't the PR doing exactly what you've asked for here?

I don't know how #10227 determines a group's state at the moment, but let's assume it's a loop over all related dependencies which are asked if they're available. Which they all are.

This can only be answered by @oxzi. Since https://github.com/Icinga/icingadb/issues/347#issuecomment-2306972320 have listed all the necessary database tables, I guess this needs to be done or maybe is already in progress with #10227.

And in this case, this contradiction is written to redis and subsequently to the database where Web will have a hard time following a hierarchy up to the actual root problem as when it reaches the group in question it stops, as it isn't failed. And of course, it didn't fail, it just doesn't have any member that is reachable.

That sounds to me like a problem that might occur in the future, not something that already exists. There isn't even a corresponding PR for #10227 yet, so there doesn't seem to be any point in discussing it here now. Originally I just wanted to understand this issue and see if I could fix it in parallel with #10227, but these two are mutually dependent.

But what if one of the members isn't in such a state that the dependency fails but it's (grand) parent is in such a state that their dependency fails? The answer: Nothing, unless the parent is checked. Which it won't if the parent <-> grandparent dependency disables checks.

If I understand your twisted question correctly :), this seems to be related to #10143.