Cacti / plugin_thold

Thold Plugin for Cacti
GNU General Public License v2.0
60 stars 60 forks source link

Time Based Thresholds broken in current develop #655

Closed tddcacti closed 3 months ago

tddcacti commented 5 months ago

Reposting from https://forums.cacti.net/viewtopic.php?t=62936&sid=88b9f75053e0cbc257a6d3f31330e47c

I'm trying to set up Time Based Thresholds and finding some unexpected behavior.

First, it appears that Cacti will send a "NORMAL" alert after a single breach, rather than waiting for the measured value to go above or below the High / Low values so many times in the measurement window. To illustrate:

I've created a threshold template against the "Unix - Logged in Users" data source, as shown below:

loggedinusers_template

I'm using a 1 minute polling cycle.

If I understand things correctly, this threshold should enter the Warning state only if there are three polling cycles within 10 minutes where the number of logged in users is greater than 1. And it should enter the Alert state only if there are three polling cycles within 10 minutes where the number of logged in users is greater than 3. In both cases the breaches don't need to be consecutive, they could be spread throughout the 10 minute window. And I should only ever get the "NORMAL" alert after either of those conditions has happened.

However I find that if I log into the server twice, so that the number of logged in users is 2, just long enough for 1 polling cycle to catch that number, and then I log out of both sessions, the next polling cycle sees 0 logged in users and a "NORMAL" alert is generated. No WARNING alert was generated, since the condition was not true for 3 out of the last 10 polling cycles.

I've tried this both with my production instance running Cacti 1.2.23 and Thold 1.5.2, as well as a test instance running 1.2.25 and the latest develop branch of Thold 1.8.

Looking at the code in thold_function.php, within the latest develop branch, I think I see the reason why this is happening. On lines 3311 and 3380 there are IF statements which look at whether the variable "alertstat" is nonzero. That variable gets set on line 2165 to the value of thold_alert within the thold_data database. That database value gets set in the previous polling cycle--if the previous polling cycle was breach, then thold_alert gets set to either STAT_HI or STAT_LO on line 2999, and saved into the database on line 3146. If the previous cycle value was warning breach, the same thing happens on lines 3151 and 3305.

I'm thinking that the issue here is that thold_alert is being set based on the current value, when it should instead be set based on whether the number of failures / warning_failures exceed the trigger / warning_trigger. Does this make sense or am I missing something?

The second thing I noticed, while reviewing the code for that first item, is that the code which sends ALERT / RE-ALERT messages is contained within the IF statement on line 2994 (if ($breach_up || $breach_down)). This means that alert messages will only be sent if the current value of the threshold is breach. While this might work fine for the initial alert, since we would need the current polling cycle to get us over the trigger level, I believe it could lead to situations where the re-alert is not sent, for example:

If the current value of a graph is not breach, but there are enough breaches in the measurement window to exceed the trigger, and we are at the re-alert cycle time: a re-alert should be sent, but I believe it would not be sent, since the current value is not breach. If this is correct, then the logic around sending re-alerts might need to be moved outside of this IF statement. Same would be true for warning re-alerts.

The third item is more of a question about the intended functionality. Because I've been using the "Unix - Logged in Users" data source for testing, I've been dealing with very controllable whole numbers, which might be unusual compared with most graphs in Cacti (part of the reason I chose it for testing). I noticed that if the value of the graph is exactly the same as the threshold level--for example if my threshold is 1 logged in user, and I log in exactly once to the server--the threshold is not considered breached.

This appears to be based on the fact that in the code, for example lines 2948-2951, we are using "less than" or "greater than" and not "less than or equal to" or "greater than or equal to". I just wanted to confirm this is the intended functionality? If so then in my test case I could simply make my threshold 0.9 instead of 1.

TheWitness commented 5 months ago

Long explanation. Glad you read through it. I do some reviews later this week and get back with you. I've been away from Cacti for a little over a month, and I need to get this 1.8 release ready to go. So, it'll happen this week. Thanks for your detailed analysis.

tddcacti commented 5 months ago

No problem @TheWitness, thanks for looking into it.

TheWitness commented 5 months ago

I spent a boat load of time developing the 'new' Baseline Deviation methods. It was intense work, but I think it's right.

tddcacti commented 5 months ago

Sounds good. Did you mean Baseline Deviation or Time Based?

tddcacti commented 4 months ago

@TheWitness any update on this?

TheWitness commented 4 months ago

Sorry @tddcacti , it's been a difficult beginning to the year. I've got too many distractions around family health issues. I tried even today to do some things. It hard when people are in hospital.

tddcacti commented 4 months ago

Sorry to hear that @TheWitness , that sounds difficult. The health of your family certainly takes priority! I hope things improve for you.

TheWitness commented 3 months ago

@tddcacti, Looking into this today. I can confirm in the latest develop, the time based tholds are broken for Warning level. Checking now.

tddcacti commented 2 months ago

@TheWitness thanks for your work on this. I took a look at the latest code and unfortunately it looks like the issues are not actually resolved:

I'm still seeing a "Returned to Normal" occur after just a single breach. I'm still testing a scenario where I'm looking for at least 3 breaches in the breach window--here is an example from my thold log:

image

I believe this is happening because we are setting thold_alert (alertstat) to STAT_HI or STAT_LO based on the current value of the graph rather than basing it on the number of breaches exceeding the trigger within the breach window.

Overall, looking more at the code, I think the structure needs to change to accommodate the nature of time-based thresholds. The current structure is too dependent on the current value of the graph, and I think there are additional states to be accounted for.

As far as states are concerned, the code currently has these states:

    define('ST_RESTORAL', 0);  // Restoral

Not sure how whether this is relevant for time based thresholds, as I note that the current time-based threshold code never logs this state. Seems to only be used for baseline thresholds?

    define('ST_TRIGGERA', 1);  // Trigger Alert

Current value is above the high alert threshold or below the low alert threshold, and we are not sending the first alert notification nor are we sending a re-alert notification

    define('ST_NOTIFYRA', 2);  // Notify Alert Retrigger

Current value is above the high alert threshold or below the low alert threshold, and we are sending a re-alert notification

    define('ST_NOTIFYWA', 3);  // Notify Warning

Current value is above the high warning threshold or below the low warning threshold (but not above or below the alert thresholds), and we are sending the first warning notification

    define('ST_NOTIFYAL', 4);  // Notify Alert

Current value is above the high alert threshold or below the low alert threshold, and we are sending the first alert notification

    define('ST_NOTIFYRS', 5);  // Notify Restoral

Current value is normal and we have just come out of the warning or alert state

Note that the current time-based threshold code logs this state twice. Should one of those be ST_RESTORAL or ST_NOTIFYAW instead?

    define('ST_TRIGGERW', 6);  // Trigger Warning

Current value is above the high warning threshold or below the low warning threshold (but not above or below the alert thresholds), and we are not sending the first warning notification nor are we sending a re-warning notification

    define('ST_NOTIFYAW', 7);  // Notify Restoral to Warning

Not sure how whether this is relevant for time based thresholds, as I note that the current time-based threshold code never logs this state. Seems to only be used for hi/low thresholds?

    define('ST_NOTIFYRAW', 8); // Notify Warning Retrigger

Current value is above the high warning threshold or below the low warning threshold (but not above or below the alert thresholds), and we are sending a re-warning notification

The above state codes are all that currently exist, but I beleive there are other states which need to exist. I've made up names for these just for the sake of this explanation, the names can be whatever makes the most sense for the naming convention:

ST_NORMAL_NOTIFYRA Current value is normal, but within the alert breach window there are still enough alert breaches to exceed the alert breach trigger, and it is time to send a re-alert notification. Similar to ST_NOTIFYRA but current value NORMAL

ST_NORMAL_NOTIFYRAW Current value is normal, but within the warning breach window there are still enough warning breaches to exceed the warning breach trigger, and it is time to send a re-warning notification. Similar to ST_NOTIFYRAW but current value NORMAL

ST_NORMAL_NOTIFYAW Current value is normal, but within the warning breach window there are enough warning breaches to exceed the warning breach trigger, and we have just come down from alert state. Need to send "return to warning" notification. Similar to ST_NOTIFYAW but current value NORMAL

ST_WARNING_NOTIFYRA Current value is warning, but within the alert breach window there are enough alert breaches to exceed the alert breach trigger, and it is time to send a re-alert notification. Similar to ST_NOTIFYRA but current value WARNING

ST_ALERT_NOTIFYRAW Current value is alert (but we have not exceeded the alert trigger within the alert breach window), and within the warning breach window there are still enough warning breaches to exceed the warning breach trigger , and it is time to send a re-warning notification. Similar to ST_NOTIFYRAW but current value ALERT

With those new states in mind, as well as what I mentioned about the alertstat value, here is a summarized version of the current code:

  1. Retrieve rrd_step, alert breach trigger value, alert breach window, warning breach trigger value, warning breach window, alert failure count, and warning failure count
  2. If current value is an alert breach
    • If we need to send initial alert notification, send it and log ST_NOTIFYAL
    • Elseif we need to send re-alert notification, send it and log ST_NOTIFYRA
    • Else log ST_TRIGGERA
    • Set thold_alert (alertstat) to STAT_HI or STAT_LO
  3. Else If current value is a warning breach
    • If we need to send initial warning notification, send it and log ST_NOTIFYWA
    • Elseif we need to send re-warning notification, send it and log ST_NOTIFYRAW
    • Elseif we have just come down from alert to warning, send return to warning and log ST_NOTIFYRAW (should it actually log ST_NOTIFYAW?)
    • Else log ST_TRIGGERW
    • Set thold_alert (alertstat) to STAT_HI or STAT_LO
  4. Else (current value normal) (3333)
    • If we have just returned from warning, send return to normal and log ST_NOTIFYRS
      • Set thold_alert (alertstat) to zero
    • Elseif we have just returned from alert, send return to normal and log ST_NOTIFYRS
      • Set thold_alert (alertstat) to zero
    • Else no notification or log needed

And now I will share a summarized revision of possible revisions which de-couple the logic from the current value and make use of the new statuses:

  1. Retrieve rrd_step, alert breach trigger value, alert breach window, warning breach trigger value, warning breach window, alert failure count, and warning failure count
  2. If current value is an alert breach, increment alert failure count
  3. Elseif current value is a warning breach, increment warning failure count
  4. If alert failure count greater than alert trigger value:
    • If we need to send initial alert notification, send it and log ST_NOTIFYAL
    • Elseif we need to send re-alert notification, send it and log
      • ST_NOTIFYRA if current value is alert
      • ST_NORMAL_NOTIFYRA if current value is normal
      • ST_WARNING_NOTIFYRA if current value is warning
    • Else log:
      • ST_TRIGGERA if current value is alert
      • ST_TRIGGERW if current value is warning
      • Nothing if current value is normal
    • Set thold_alert (alertstat) to STAT_HI or STAT_LO
  5. Elseif warning failure count greater warning trigger value:
    • If we need to send initial warning notification, send it and log ST_NOTIFYWA
    • Elseif we have just returned to warning from alert, send return to warning and log:
      • ST_NOTIFYAW if current value is warning
      • ST_NORMAL_NOTIFYAW if current value is normal
    • Elseif we need to send re-warning notification, send it and log
      • ST_ALERT_NOTIFYRAW if current value is alert
      • ST_NORMAL_NOTIFYRAW if current value is normal
      • ST_NOTIFYRAW if current value is warning
    • Else log:
      • ST_TRIGGERW if current value is warning
      • Nothing if current value is normal
    • Set thold_alert (alertstat) to STAT_HI or STAT_LO
  6. Else
    • If we’ve returned to normal in this cycle, send return to normal and log ST_NOTIFYRS
      • Set thold_alert (alertstat) to zero
    • Else log
      • ST_TRIGGERA if current value is alert
      • ST_TRIGGERW if current value is warning
      • Nothing if current value is normal
TheWitness commented 2 months ago

@tddcacti, it's much simpler and done in a way that does not bother to have to check the rrdfile. Time based only counts the entries in the log table of a breach condition.

Pulling from the rrdfiles exclusively does not scale too well, but better than 20 years ago for sure.

I must have broke it in my last update. Let me check.

tddcacti commented 2 months ago

@TheWitness understood, yes I can see how the entries in the thold log table are being used. I wasn't trying to suggest pulling from the rrdfiles, instead I'm suggesting that the thold log table needs some additional options for the status value, to account for all possible time-based situations. Also suggesting that the logic of time-based needs to be less dependent on the current value of the graph ($thold_data['lastread']), and more dependent on the entries in the thold log table.