Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Runtime attributes contain values of the current state in last_... variables #6302

Closed ekeih closed 4 years ago

ekeih commented 6 years ago

Hi,

I will try to explain our use-case to make it easier to understand the rest of the issue. For different state changes we need different notifications in our setup:

  1. Ok -> Warning -> Ok: No notifications at all
  2. Ok -> Critical -> Warning -> Ok: Recovery Notification

For 2. we tried to implement the following condition in our notification script:

notification.type = RECOVERY AND
service.last_state = WARNING AND
service.last_state_critical > service.last_state_ok

Currently this does not seem to be possible (at least not in the way we expected). The runtime attributes in https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#service have confusing names:

So during the Warning -> Ok-Change in 2. we have no way to access the timestamp of the last time that the service was ok.

So in my opinion the last_... variables should contain the timestamp of the last state change BEFORE the current state change. Or the last_state variable would have to contain the current state to be consistent with the other variable names. Another option would be to improve the variable names and the documentation. This would not solve our use-case, but it would avoid confusion based on the different meaning of last in the variable names.

I hope my description is comprehensible. If not I will try to explain again ;)

(I guess the same applies to the host runtime attributes.)

Your Environment

dnsmichi commented 6 years ago

I've only read the requirements, which leads to the notification filter excluding the Warning state thus notifying only on Critical (and OK when it recovers).

What's wrong with that attempt and why bother with your own state history logic later on?

ekeih commented 6 years ago

Thank you for your super fast answer :)

As far as I understand a notification with a filter that excludes the warning state would not allow me to ONLY send a recovery when service changed like this: ok->critical->warning-ok, but not when it changed ok->warning->ok. Right? Or do you mean something else?

In the second half of my text I also explain why I think that the current names of the variables are wrong or at least confusing. I would like some feedback what you think about it. (Of course I would prefer a solution where the variables contain the content I expected after reading the documentation. Especially as I think that historic information is much more useful in this variables.)

ekeih commented 6 years ago

Any thoughts about my last message? :)

dnsmichi commented 6 years ago

Sorry, this is one of those issues where I need time, pen and paper. I'm waiting for others to share their input meanwhile.

ekeih commented 6 years ago

No need to apologize, I will wait. Thanks for the confirmation that this issue is not forgotten. :)

dnsmichi commented 6 years ago

Might be a candidate for the OSMC hackathon.

morremeyer commented 5 years ago

As a side effect the service.duration_sec is always very small (0.000767) in case of a recovery because it contains the number of seconds since the current state change. We would expect the time since the previous state.

In that specific use case, the service.duration_sec is not the metric we should use, as I think it to be „correct“ in the sense of naming. It tells you how long the service has been in the state it is currently in.

So it isn’t really useful for this issue with the last_* values not conforming to our expectations.

morremeyer commented 5 years ago

After talking to @lippserd and @Crunsher and discussing different ideas and solutions, I think the most elegant solution would be to not change the current behaviour as it is the one that is technically correct.

Instead of altering the behaviour, I propose adding two values:

As the „state worseness“ is OK -> WARNING -> CRITICAL -> UNKNOWN, the value of previous_worst_state will be updated only if you traverse this to the right.

Once the state changes from OK to something else again, previous_worst_state will be reset to OK.

See the attached a graph of state changes for details of state changes.

icinga2-state-graph

I named it previous_ to clearly distinguish that it is referring to a state that was recorded before the last_state_change happened — in contrast to the last_ values.

In pseudo code, it would need to look like this:

if last_state == OK
  previous_worst_hard_state = OK
  previous_worst_hard_state_change = old_state -> last_hard_state_change

else if old_state >= state # >= meaning worse
    previous_worst_hard_state = old_state
    previous_worst_hard_state_change = old_state -> last_hard_state_change

There are two open questions I can’t answer myself, so please give me feedback on that

  1. Does this make sense for the software? I’m too far into this in my mind to answer that question anymore.
  2. Is this also needed for soft states?

@dnsmichi @Crunsher could you look at this and give me feedback, please?

Crunsher commented 5 years ago

Is this also needed for soft states?

Nah, that's not really helpful information. Checks may randomly fail once in a blue moon and we don't care about that.

Does this make sense for the software?

We certainly want this for icingadb ^_^

morremeyer commented 5 years ago

Thanks for the feedback!

I’ll start implementing this next week.

dnsmichi commented 5 years ago

I trust your expertise on this since you've talked and discussed in person during the OSMC hackathon. It also is a good idea to create new attributes which allow to keep the current known behaviour. If this turns well for a good reporting feature, I am all for it.

mj84 commented 5 years ago

Alright, as suggested by @dnsmichi in #5533, I'm gonna explain my reporting situation a bit further.

In order to reliably determine SLA values (percentage of a timeframe, in which a monitored object was available) I am only interested in hard state changes, which are being populated by Icinga2 in the IDO table icinga_statehistory. (soft states don't account for non-OK states.) Each entry in that table is associated with a monitored object and represents a state change. It has a time stamp, some info about the current state and also if it is a hard state change, which needs to be accounted for. Basically, I wrote a loop over all entries in a given time frame, and determine if the duration of a specific state needs to be considered for OK time or NOK time. (downtimes are being honored as well). At the end you do OK seconds / Total seconds * 100 = availability and that's it..

Currently (because of last_hard_state not behaving the way I expected it to :) ) I have to walk over all entries in a given timeframe (using a for loop), and manually check, if the previous state was a hard state, in order to calculate the amount of OK-seconds or NOK-seconds respectively.

So in order to reduce the complexity of said availability function, it would be very helpful to know the value of the last (previous) hard state. Then I could just do a select on all hard state changes and could calculate the duration of OK time or NOK time based on the values of the previous hard states.

Thomas-Gelf commented 5 years ago

It's good to see this discussion going on. However, to me it seems like different use-cases are trying to tackle a slightly related issue from a completely different perspective.

@mj84: this alone will not help you. This is not enough. You'll loose events. You might have no event in the chosen period. You might be forced to mix in events from before or after the chosen period or even the current object state to get a meaningful result. Downtimes might come into play, they might be considered "legal" SLA violations. Downtimes could be nested. Someone asks you to calculate Availabiliity only for specific time periods (weekdays, 9 to 5). Another one wants to "fix" an SLA afterwards because of .

Believe me, for your use-case time dedicated to this issue will be wasted. I've been there before. With PostgreSQL pick a Cursor, jump from line to line, do your math. here is a working example for MySQL/MariaDB. It doesn't address every weird scenario, but it isn't that bad at all. It needs to be written in a different way for PostgreSQL, but you can eventually steal some ideas.

Cheers, Thomas

mj84 commented 5 years ago

@Thomas-Gelf: Actually, my approach is kind of finalized and is being used productively in my environment for about 10 months with mostly monthly and weekly reports being generated.

I have implemented my PL/pgSQL function after inspecting the icinga_availability mysql function that exists in Icinga Reports 1.x. I have taken care of some cases you mentioned such as when no state change occurs during a timeframe. Downtimes are considered as well, though I am not 100% sure about nested downtimes, currently nested downtimes might lead to a availability > 100%. Sure, there is always the possibility that someone wants to fix an SLA by creating a downtime in the past, though that could be mitigated in Icingaweb2 / Icinga2 by disallowing downtimes to have a start date in the past. Surely enough as well the complexity for SLA calculations knows no limits, I bet there is always some rare use case which invalidates any existing approach. But for our (relatively) simple and straight-forward case with 24x7 SLA and check periods we found a solution in our custom Postgres function. I surely will have a look at your MySQL example and the reporting module for icingaweb2, maybe I can provide input for a Postgres variant!

Cheers

morremeyer commented 5 years ago

@mj84 I see your point, but I agree with @Thomas-Gelf here. While your current approach works for you, I consider it to be too specific to make it into a general purpose solution.

I have thought about not only adding the previous_worst_hard_state metric, but also a full set of previous_* metrics corresponding to the currently existing last_* metrics.

The idea behind that would be that you can look at a state change and always determine the state after the change by looking at the last_* values and the state before the state change by looking at the previous_* values.

I have not considered the implications of this yet and when I finished that, I’ll open an issue to discuss them. But as far as I understand your use case, this would then also solve your issue.

But for this issue, I’ll limit myself to the introduction of the previous_worst_hard_state metric.

N-o-X commented 5 years ago

@mauricemeyer you got any update on the implementation state? I would take that task, if you don't have anything laying around yet.

morremeyer commented 5 years ago

I have a cork in progress implementation with loads of debug output in my fork.

Work on it will continue on Monday as I’m currently on vacation.

morremeyer commented 5 years ago

I think I need some help with how to make it work for hosts, too - maybe I’m missing something or I can’t see the wood for the trees any more 🤔

The change in my feature branch is working for services, but there’s something wrong when I try to apply this to hosts.

The current state (timestamps are not updated yet) can be compared to the master easily.

As you see, I added the reset to OK on line 211 - but this does a hard coded reset to ServiceOK.

Will this also work for Hosts? I never really looked into host states as they always „just worked“ for me. My understanding is as follows:

The host derives its state from the check_command it uses. Therefore, this checkable is treated as a service that never appears to to user because the host wraps this „pseudo-service“. Therefore, the host state is just an abstraction layer above a service.

Am I correct with this or did I get something fundamentally wrong?

Crunsher commented 5 years ago

Yes your assumption is correct, the host states are just wrapped service states. That means resetting the state to ServiceOK should work in both cases. I'd have to take a closer look as to why you are seeing incorrect states.

morremeyer commented 5 years ago

Okay, thanks for that info. I will debug this again next week to find out if I’m just misreading something or if it’s an actual error.

morremeyer commented 5 years ago

I have tested some more and it looks like defining my test check as check_command does exactly nothing.

Can somebody please have a look at/test my current implementation and tell me if I just failed to setup my test environment correctly or if there is really an issue?

dnsmichi commented 4 years ago

Imho that's implemented with previous_hard_state etc. coming with 2.12 for IcingaDB.