Closed tddcacti closed 3 months ago
Long explanation. Glad you read through it. I do some reviews later this week and get back with you. I've been away from Cacti for a little over a month, and I need to get this 1.8 release ready to go. So, it'll happen this week. Thanks for your detailed analysis.
No problem @TheWitness, thanks for looking into it.
I spent a boat load of time developing the 'new' Baseline Deviation methods. It was intense work, but I think it's right.
Sounds good. Did you mean Baseline Deviation or Time Based?
@TheWitness any update on this?
Sorry @tddcacti , it's been a difficult beginning to the year. I've got too many distractions around family health issues. I tried even today to do some things. It hard when people are in hospital.
Sorry to hear that @TheWitness , that sounds difficult. The health of your family certainly takes priority! I hope things improve for you.
@tddcacti, Looking into this today. I can confirm in the latest develop, the time based tholds are broken for Warning level. Checking now.
@TheWitness thanks for your work on this. I took a look at the latest code and unfortunately it looks like the issues are not actually resolved:
I'm still seeing a "Returned to Normal" occur after just a single breach. I'm still testing a scenario where I'm looking for at least 3 breaches in the breach window--here is an example from my thold log:
I believe this is happening because we are setting thold_alert (alertstat) to STAT_HI or STAT_LO based on the current value of the graph rather than basing it on the number of breaches exceeding the trigger within the breach window.
Overall, looking more at the code, I think the structure needs to change to accommodate the nature of time-based thresholds. The current structure is too dependent on the current value of the graph, and I think there are additional states to be accounted for.
As far as states are concerned, the code currently has these states:
define('ST_RESTORAL', 0); // Restoral
Not sure how whether this is relevant for time based thresholds, as I note that the current time-based threshold code never logs this state. Seems to only be used for baseline thresholds?
define('ST_TRIGGERA', 1); // Trigger Alert
Current value is above the high alert threshold or below the low alert threshold, and we are not sending the first alert notification nor are we sending a re-alert notification
define('ST_NOTIFYRA', 2); // Notify Alert Retrigger
Current value is above the high alert threshold or below the low alert threshold, and we are sending a re-alert notification
define('ST_NOTIFYWA', 3); // Notify Warning
Current value is above the high warning threshold or below the low warning threshold (but not above or below the alert thresholds), and we are sending the first warning notification
define('ST_NOTIFYAL', 4); // Notify Alert
Current value is above the high alert threshold or below the low alert threshold, and we are sending the first alert notification
define('ST_NOTIFYRS', 5); // Notify Restoral
Current value is normal and we have just come out of the warning or alert state
Note that the current time-based threshold code logs this state twice. Should one of those be ST_RESTORAL or ST_NOTIFYAW instead?
define('ST_TRIGGERW', 6); // Trigger Warning
Current value is above the high warning threshold or below the low warning threshold (but not above or below the alert thresholds), and we are not sending the first warning notification nor are we sending a re-warning notification
define('ST_NOTIFYAW', 7); // Notify Restoral to Warning
Not sure how whether this is relevant for time based thresholds, as I note that the current time-based threshold code never logs this state. Seems to only be used for hi/low thresholds?
define('ST_NOTIFYRAW', 8); // Notify Warning Retrigger
Current value is above the high warning threshold or below the low warning threshold (but not above or below the alert thresholds), and we are sending a re-warning notification
The above state codes are all that currently exist, but I beleive there are other states which need to exist. I've made up names for these just for the sake of this explanation, the names can be whatever makes the most sense for the naming convention:
ST_NORMAL_NOTIFYRA Current value is normal, but within the alert breach window there are still enough alert breaches to exceed the alert breach trigger, and it is time to send a re-alert notification. Similar to ST_NOTIFYRA but current value NORMAL
ST_NORMAL_NOTIFYRAW Current value is normal, but within the warning breach window there are still enough warning breaches to exceed the warning breach trigger, and it is time to send a re-warning notification. Similar to ST_NOTIFYRAW but current value NORMAL
ST_NORMAL_NOTIFYAW Current value is normal, but within the warning breach window there are enough warning breaches to exceed the warning breach trigger, and we have just come down from alert state. Need to send "return to warning" notification. Similar to ST_NOTIFYAW but current value NORMAL
ST_WARNING_NOTIFYRA Current value is warning, but within the alert breach window there are enough alert breaches to exceed the alert breach trigger, and it is time to send a re-alert notification. Similar to ST_NOTIFYRA but current value WARNING
ST_ALERT_NOTIFYRAW Current value is alert (but we have not exceeded the alert trigger within the alert breach window), and within the warning breach window there are still enough warning breaches to exceed the warning breach trigger , and it is time to send a re-warning notification. Similar to ST_NOTIFYRAW but current value ALERT
With those new states in mind, as well as what I mentioned about the alertstat value, here is a summarized version of the current code:
And now I will share a summarized revision of possible revisions which de-couple the logic from the current value and make use of the new statuses:
@tddcacti, it's much simpler and done in a way that does not bother to have to check the rrdfile. Time based only counts the entries in the log table of a breach condition.
Pulling from the rrdfiles exclusively does not scale too well, but better than 20 years ago for sure.
I must have broke it in my last update. Let me check.
@TheWitness understood, yes I can see how the entries in the thold log table are being used. I wasn't trying to suggest pulling from the rrdfiles, instead I'm suggesting that the thold log table needs some additional options for the status value, to account for all possible time-based situations. Also suggesting that the logic of time-based needs to be less dependent on the current value of the graph ($thold_data['lastread']), and more dependent on the entries in the thold log table.
Reposting from https://forums.cacti.net/viewtopic.php?t=62936&sid=88b9f75053e0cbc257a6d3f31330e47c
I'm trying to set up Time Based Thresholds and finding some unexpected behavior.
First, it appears that Cacti will send a "NORMAL" alert after a single breach, rather than waiting for the measured value to go above or below the High / Low values so many times in the measurement window. To illustrate:
I've created a threshold template against the "Unix - Logged in Users" data source, as shown below:
I'm using a 1 minute polling cycle.
If I understand things correctly, this threshold should enter the Warning state only if there are three polling cycles within 10 minutes where the number of logged in users is greater than 1. And it should enter the Alert state only if there are three polling cycles within 10 minutes where the number of logged in users is greater than 3. In both cases the breaches don't need to be consecutive, they could be spread throughout the 10 minute window. And I should only ever get the "NORMAL" alert after either of those conditions has happened.
However I find that if I log into the server twice, so that the number of logged in users is 2, just long enough for 1 polling cycle to catch that number, and then I log out of both sessions, the next polling cycle sees 0 logged in users and a "NORMAL" alert is generated. No WARNING alert was generated, since the condition was not true for 3 out of the last 10 polling cycles.
I've tried this both with my production instance running Cacti 1.2.23 and Thold 1.5.2, as well as a test instance running 1.2.25 and the latest develop branch of Thold 1.8.
Looking at the code in thold_function.php, within the latest develop branch, I think I see the reason why this is happening. On lines 3311 and 3380 there are IF statements which look at whether the variable "alertstat" is nonzero. That variable gets set on line 2165 to the value of thold_alert within the thold_data database. That database value gets set in the previous polling cycle--if the previous polling cycle was breach, then thold_alert gets set to either STAT_HI or STAT_LO on line 2999, and saved into the database on line 3146. If the previous cycle value was warning breach, the same thing happens on lines 3151 and 3305.
I'm thinking that the issue here is that thold_alert is being set based on the current value, when it should instead be set based on whether the number of failures / warning_failures exceed the trigger / warning_trigger. Does this make sense or am I missing something?
The second thing I noticed, while reviewing the code for that first item, is that the code which sends ALERT / RE-ALERT messages is contained within the IF statement on line 2994 (if ($breach_up || $breach_down)). This means that alert messages will only be sent if the current value of the threshold is breach. While this might work fine for the initial alert, since we would need the current polling cycle to get us over the trigger level, I believe it could lead to situations where the re-alert is not sent, for example:
If the current value of a graph is not breach, but there are enough breaches in the measurement window to exceed the trigger, and we are at the re-alert cycle time: a re-alert should be sent, but I believe it would not be sent, since the current value is not breach. If this is correct, then the logic around sending re-alerts might need to be moved outside of this IF statement. Same would be true for warning re-alerts.
The third item is more of a question about the intended functionality. Because I've been using the "Unix - Logged in Users" data source for testing, I've been dealing with very controllable whole numbers, which might be unusual compared with most graphs in Cacti (part of the reason I chose it for testing). I noticed that if the value of the graph is exactly the same as the threshold level--for example if my threshold is 1 logged in user, and I log in exactly once to the server--the threshold is not considered breached.
This appears to be based on the fact that in the code, for example lines 2948-2951, we are using "less than" or "greater than" and not "less than or equal to" or "greater than or equal to". I just wanted to confirm this is the intended functionality? If so then in my test case I could simply make my threshold 0.9 instead of 1.