Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2k stars 574 forks source link

Track effect of an object on dependent children #10158

Open nilmerg opened 2 weeks ago

nilmerg commented 2 weeks ago

Is your feature request related to a problem? Please describe.

In Icinga DB Web we'd like to show an indicator in lists showing the number of potentially affected children of a particular host/service. All affected children, i.e. also grandchildren.

Describe the solution you'd like

Icinga must calculate this for each parent during startup. Replace startup with whatever you like, my expectation is just that Icinga does not need to calculate this on every state change.

Though, what I'd like Icinga to calculate on every state change, is whether a parent is responsible for any now unreachable child, wherever in the hierarchy. (Somewhat similar to #10143)

The result should be that in the database the number is available in e.g. host.affected_children (uint) and host_state.affects_children (bool enum).

julianbrost commented 1 week ago

Though, what I'd like Icinga to calculate on every state change, is whether a parent is responsible for any now unreachable child, wherever in the hierarchy. (Somewhat similar to #10143)

The result should be that in the database the number is available in [...] host_state.affects_children (bool enum).

I'm not 100% sure what this is asking for. Is this supposed to say whether any of the potentially affected children is actually in a problem state?

raviks789 commented 1 week ago

This column says that there may be at least one child that would be in problem state if there is a problem with the parent.

julianbrost commented 1 week ago

I still don't get it. A configured dependency does not imply that the child must be in a problem state if the parent is in a problem state. It just says that if both are failed, there's a good chance that one caused the other. Can you provide an example of a dependency structure and how you'd expect that bool to be set?

yhabteab commented 1 week ago

The result should be that in the database the number is available in e.g. host.affected_children (uint) and host_state.affects_children (bool enum).

I actually do understand that, as the host.affected_children columns just show the number of dependent children on that host, and host_state.affects_children is just the boolean representation of that expression host_state.affected_children != 0. A host problem state may not directly affect its children when it is part of a redundancy group, and in this case host_state.affected_children would simply be 0 and host_state.affects_children would be set to false accordingly.

raviks789 commented 1 week ago

This just a simple example.

Suppose we have a parent say Service-A(I will assume the parent is a service here) with two children (Child-1, Child-2) and its dependency is configured to fail if Service-A is not in OK state. And one of the children also belongs to another dependency with parent say Service-B with one child Child-2 which is configured to fail if the Service-B is not in OK or Warning state.

Now, if Service-B is in OK or Warning state and Service-A is OK then both Child-1 and Child-2 are reachable and service_state.affects_children is false for both the parents. But now if Service-A is not in OK state then both the children are unreachable and service_state.affects_children is true for Service-A, but is false for Service-B. But if Service-A is in OK state and Service-B is neither in OK or Warning state then only Child-2 is unreachable and service_state.affects_children is true. But if Service-A is not in OK state and Service-B is in neither OK or Warning state then service_state.affects_children is set to true for both and both the children are unreachable again.

Example evaluation of affects_children Service-A Service-B Service-A.affects_children Service-B.affects_children Child-A Child-B
OK Warning false false reachable reachable
Warning OK true false ureachable ureachable
julianbrost commented 1 week ago

Now, if Service-B is in OK or Warning state and Service-A is OK then both Child-1 and Child-2 are unreachable

I guess that should say "reachable" instead?

You never mentioned any state of the children. Does this imply that affects_children does not have to take this into account?

Let me have another shot at trying to rephrase this so that we can see if we think of the same now: x.affects_children says whether there exists any child that has a path of failed dependencies to x. However, I have a hard time describing what that would tell in the end, like "fixing this checkable will make other checkables reachable again" (not necessarily, there could be a second failed dependency) or "fixing this checkable is required to make other checkables reachable" (not necessarily, if a redundancy group is involved, fixing another checkable could make the children reachable as well).

https://github.com/Icinga/icingadb-web/issues/1058 doesn't really help me understand either what information this bool should convey to the user in the end.

nilmerg commented 1 week ago

Forget child states. (Don't confuse this with reachability!) They're not relevant at all when talking about reachability.

affects_children has one single purpose: Show the total number (affected_children) only if the parent is actually responsible (i.e. one of it's direct related dependencies result in the respective child to be unreachable)

stateDiagram-v2
    state if_state <<choice>>
    [*] --> affects_children
    affects_children --> if_state
    if_state --> HideTotal: if n
    if_state --> ShowTotal: if y

I don't know what is difficult to understand here. If you need further explanations, we should discuss this in person tomorrow.

julianbrost commented 1 week ago

Thank you very much for this extraordinarily helpful flowchart. Unfortunately, it doesn't answer the question why you want to hide the number sometimes. Like there must be something different what makes the boolean different from just checking the number for zero. Like is it supposed to be false if all children are still reachable due to another OK parent in a redundancy group? Is it supposed to be false if the parent is in a WARN state and all dependencies with it as a parent have states = [ OK, Warning ] set?

Forget child states. (Don't confuse this with reachability!) They're not relevant at all when talking about reachability.

So I guess that's a yes for the question I asked:

You never mentioned any state of the children. Does this imply that affects_children does not have to take this into account?

Does Yonas' comment describe what you're asking for? Otherwise, I have the feeling that the specification might be a bit unclear if multiple persons fail to understand it.

Oh and your comment just added a new reason for confusion:

is responsible for any now unreachable child, wherever in the hierarchy

i.e. one of it's direct related dependencies result in the respective child to be unreachable

"wherever in the hierarchy" and "direct related dependencies" don't really fit together.

nilmerg commented 1 week ago

Thank you very much for this extraordinarily helpful flowchart.

Sorry, my impression was that I communicate with Icinga professionals.

--

I don't think I need to outline the exact behavior of Icinga dependencies here. Ravi included one of them in his example, for clarification. Redundancy groups are of course another one, so are time periods. (surprise!)

The bool affects_children just indicates that a dependency, where the respective host/service is the parent, decides that its child is unreachable.

Can we stop nitpicking now, please?