cvtienhoven / graylog-plugin-aggregates

Aggregates plugin for Graylog
https://marketplace.graylog.org
GNU General Public License v3.0
53 stars 12 forks source link

Alerts not resolving #28

Closed jothoma1 closed 6 years ago

jothoma1 commented 7 years ago

Hi, I have some alerts that dont resolve and i dont know why. I have actually 27 pages of alerts which is quite unusable... I use Graylog 2.3.2 with aggregate 2.1.1 Here is an example, this alert condition isn't satisfied but in the alert timeline we see "Condition is still satisfied, alert is unresolved" : image Do you have any ideas of how can i fix this ?

cvtienhoven commented 7 years ago

Hi @jothoma1, that's strange indeed, if the alert condition is still coupled to the actual rule it should resolve the alert. Do you have access to the logging of the Graylog master node? I'm interested in lines like

2017-11-06T12:20:54.059+01:00 INFO [AggregatesAlertCondition] X found for Packet-Type=Y

Where X and Y are variable of course. Ideally, if you could run the server in debug mode for a short while a bit of debug logging from the plugin would be very helpful as well.

jothoma1 commented 7 years ago

Hi @cvtienhoven Yes i have access to Graylog master and i see lines but the last one is quite old...

2017-10-24T12:12:18.388+02:00 INFO [AggregatesAlertCondition] 300 found for Packet-Type=Access-Reject

jebucha commented 7 years ago

We are also seeing the same behavior. The first batch was right around when I started my first node with the updated plugin (we went from 2.0.0 to 2.1.1), another batch Friday. In what might be related, we are also getting duplicate alerts for these aggregate rules. One example rule is systems with 95% or more disk usage. We had deleted that rule last week to quiet down all the duplicates, but not long after I re-added it this morning we got 10 for the same event/server. One example was a batch of 10 for the same event not long after I re-added the rule, but the odd thing is that the email alert header matched one of the unresolved alerts from Friday morning.

jebucha commented 7 years ago

@jothoma1 Have you experienced duplicates of email alerts in relation to the events which aren't resolving? We have "repeat notifications" turned off for our aggregate rules, but we are getting 10x duplicate emails for some alerts following the 2.0.0 to 2.1.1 upgrade last Thursday morning.

jebucha commented 7 years ago

@cvtienhoven Is there anything I can provide to assist in understanding why we're getting duplicate notifications (possibly tied to the alerts not resolving)? I'm hesitant to turn on debug logging as our nodes are fairly busy, we average around 150 million messages per day. But if that's what you need to help troubleshoot let me know. Thank you by the way, this plugin addresses functionality key to proper alerting on our systems which is lacking from what Graylog natively offers.

cvtienhoven commented 6 years ago

@jebucha more logging is always better, but I can imagine that you're hesitant about setting it to debug level in production. I'm currently looking into this. In 2.1.1 I built a feature that if someone removes an alert condition without removing the rule (e.g. inconsistency), it automatically re-creates the alert condition for that rule. It would be helpful to know is the condition for re-creation is hit in your case, do you see log lines like the following?

2017-11-09T09:24:44.449+01:00 WARN [Aggregates] Alert Condition removed for rule [rule name], re-instantiating

These are logged on WARN level, so no need to adjust the level for this.

jebucha commented 6 years ago

Yes I am seeing that entry, I have a test Graylog setup on which I can readily flip on debug if needed so I've updated Graylog/plugin versions to match production.

2017-11-09T07:43:42.234-06:00 WARN [Aggregates] Alert Condition removed for rule [CampusUpdater test jeb], re-instantiating

Ok so I just flipped on Debug logging, sent an triggering event in via curl. Including a few entries prior in the event they're helpful. In this case everything functioned as expected. Alert triggered, notification sent, alert resolved. So this did not replicate the unresolved alerts or duplicate notifications we're experiencing in production.

2017-11-09T07:43:42.234-06:00 WARN [Aggregates] Alert Condition removed for rule [CampusUpdater test jeb], re-instantiating 2017-11-09T07:46:42.234-06:00 WARN [Aggregates] Alert Condition removed for rule [CampusUpdater test jeb], re-instantiating 2017-11-09T07:46:51.717-06:00 DEBUG [AlertScannerThread] Running alert checks. 2017-11-09T07:46:51.720-06:00 DEBUG [AlertScannerThread] There are 1 streams with configured alert conditions. 2017-11-09T07:46:51.720-06:00 DEBUG [AlertScannerThread] Stream [597229785e8c1c9cf4e863d7: "CampusUpdater Errors"] has [1] configured alert conditions. 2017-11-09T07:46:51.727-06:00 DEBUG [AlertScanner] Alert condition [491ecbbf-293e-42ed-adb8-a46642c1b3be:Aggregates Alert={The same value of field 'level' occurs 1 or more times in a 1 minute interval}, stream:={597229785e8c1c9cf4e863d7: "CampusUpdater Errors"}] is not triggered and is marked as resolved. Nothing to do. 2017-11-09T07:59:51.717-06:00 DEBUG [AlertScannerThread] Running alert checks. 2017-11-09T07:59:51.719-06:00 DEBUG [AlertScannerThread] There are 1 streams with configured alert conditions. 2017-11-09T07:59:51.719-06:00 DEBUG [AlertScannerThread] Stream [597229785e8c1c9cf4e863d7: "CampusUpdater Errors"] has [1] configured alert conditions. 2017-11-09T07:59:51.730-06:00 INFO [AggregatesAlertCondition] 1 found for level=3 2017-11-09T07:59:51.744-06:00 DEBUG [AlertScanner] Alert condition [491ecbbf-293e-42ed-adb8-a46642c1b3be:Aggregates Alert={The same value of field 'level' occurs 1 or more times in a 1 minute interval}, stream:={597229785e8c1c9cf4e863d7: "CampusUpdater Errors"}] is triggered. Sending alerts. 2017-11-09T07:59:51.780-06:00 DEBUG [FormattedEmailAlertSender] Sending mail to

Also, is it by-design that disabling an Aggregate rule does not remove the corresponding alert condition? Or should I create a separate issue for that?

jebucha commented 6 years ago

Sorry meant to include the logs from when the rule was disabled.

2017-11-09T08:05:51.717-06:00 DEBUG [AlertScannerThread] Running alert checks. 2017-11-09T08:05:51.720-06:00 DEBUG [AlertScannerThread] There are 1 streams with configured alert conditions. 2017-11-09T08:05:51.720-06:00 DEBUG [AlertScannerThread] Stream [597229785e8c1c9cf4e863d7: "CampusUpdater Errors"] has [1] configured alert conditions. 2017-11-09T08:05:51.727-06:00 INFO [AggregatesAlertCondition] 1 found for level=3 2017-11-09T08:05:51.734-06:00 DEBUG [AlertScanner] Alert condition [491ecbbf-293e-42ed-adb8-a46642c1b3be:Aggregates Alert={The same value of field 'level' occurs 1 or more times in a 1 minute interval}, stream:={597229785e8c1c9cf4e863d7: "CampusUpdater Errors"}] is triggered. Sending alerts. 2017-11-09T08:05:51.738-06:00 DEBUG [FormattedEmailAlertSender] Sending mail to

cvtienhoven commented 6 years ago

@jebucha thanks for your input. Yeah it's by design the alert condition isn't removed when the rule is disabled, but I think I'm going to alter that behavior. Currently, plugin checks every minute the if the AlertCondition for a rule exists (for consistency) except when the rule is disabled.

I'm considering removing the "disable" button entirely or, as you mentioned, removing the alert condition when it's disabled.

For now, I'm going to dive a little deeper, to be continued :)

jebucha commented 6 years ago

@cvtienhoven Sounds good, and thank you again for creating this plugin, and maintaining it. This functionality has definitely become mission critical to being able to alert on system metrics, and we have > 1,000 servers out in the field for which we're pulling in metrics.

cvtienhoven commented 6 years ago

@jebucha @jothoma1 I just created a SNAPSHOT release: https://github.com/cvtienhoven/graylog-plugin-aggregates/releases/tag/2.2.0-SNAPSHOT

It would be great if you guys could give this a test drive. What I added/modified is the following:

By default the maintenance task for resolving orphaned alerts is disabled. You can enable the maintenance task by heading over to the System tab and choose Configurations. In that page you can see the Aggregates Plugin section, where you can update the config and set Resolve Orphaned Alerts to yes (checked).

image

jebucha commented 6 years ago

@cvtienhoven I applied 2.2.0-SNAPSHOT to my test install this morning. Of the handful of events I've sent in to Graylog to trigger, I am receiving only one of the expected alerts, however I am still seeing some unresolved alerts (which are tied to a rule). I did flip on Resolve Orphaned Alerts, and cranked down purge history to 1 hour just for testing purposes, but I still have one unresolved alert from about 4hrs ago.

cvtienhoven commented 6 years ago

@jebucha Did you create the rule with the unresolved alert after the upgrade to 2.2.0-SNAPSHOT or before? And could you perhaps provide that rule? Don't know if there's any sensitive data in them, but it would help a lot to verify. I'm currently testing with a lot of messages using the random HTTP message generator and alerts get resolved after the input is stopped, so I'm looking for a way to reproduce your situation.

jebucha commented 6 years ago

aggregates_test_rule

@cvtienhoven The rule was pre-existing, but I'll delete and recreate along with the condition and see if I still wind up with unresolved alerts.

Ok so I deleted the rule, flipped on debug, recreated the rule, sent in a event which would trigger it, left debug on until I received the email notification. Initially the alert showed as unresolved, but does appear to have shifted to resolved about 90 seconds later.

aggregates_behavior.txt

cvtienhoven commented 6 years ago

@jebucha That sounds like correct behavior, as the alert scanner thread runs once every minute, so a small delay in resolving alerts is to be expected.

jebucha commented 6 years ago

@cvtienhoven

Please let me know if you'd rather I open a unique issue for this, but we continue to have duplicate alerts from a given condition match, and we are up to 193 pages of unresolved alerts, going back to November. I have the config of purge history events set to P1D, and resolve orphaned alerts enabled, which has been the case since you released that option. I am also seeing the following in the log on my master Graylog node.

2018-01-22T08:18:05.860-06:00 ERROR [AggregatesMaintenance] Uncaught exception in periodical java.lang.NullPointerException: null at org.graylog.plugins.aggregates.maintenance.AggregatesMaintenance.resolveOrphanedAlerts(AggregatesMaintenance.java:134) ~[?:?] at org.graylog.plugins.aggregates.maintenance.AggregatesMaintenance.doRun(AggregatesMaintenance.java:79) ~[?:?] at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77) [graylog.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_131] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_131] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

What are your thoughts?

jothoma1 commented 6 years ago

@jebucha @cvtienhoven Same for me, i still have duplicated alerts with your new version and graylog updated to 2.4.1 (108 pages actually)

cvtienhoven commented 6 years ago

I think I see where that nullpointer comes from, I'm going to release a fix for that, as it might cause all other alertconditions and alerts to stay orphaned. It looks like the rule that's being evaluated has a null alertCondtionId, which means it might have been created with an older version of the Aggregates Plugin. I added a warning log line for that scenario so you can identify which rules need re-creation.

cvtienhoven commented 6 years ago

@jothoma1 @jebucha Could you guys give version 2.2.1 a test drive please?

https://github.com/cvtienhoven/graylog-plugin-aggregates/releases/tag/2.2.1

cvtienhoven commented 6 years ago

@jebucha @jothoma1 if you're using Graylog version 2.4.x, you can pick version 2.2.2 of the plugin which was built against the 2.4 branch:

https://github.com/cvtienhoven/graylog-plugin-aggregates/releases/tag/2.2.2

jebucha commented 6 years ago

@cvtienhoven Sounds good, I'll see when I can work the patch into a maintenance window and report back after it's been online.

jothoma1 commented 6 years ago

@cvtienhoven thanks will try it asap and report too !

sirbod2005 commented 6 years ago

I have upgraded Graylog from 2.2.3 to 2.4.3 and the Aggregates plugin 1.0.1 was replaced with 2.2.2. I deleted and recreated my alerts and none of my alerts were resolving. I've just replaced the Aggregates plugin with 2.2.3 and they are still not resolving. Is there anything else I/we need to do to address this?

cvtienhoven commented 6 years ago

Did you configure the plugin to resolve orphaned alerts via System -> Configurations, setting Resolve Orphaned Alerts to yes?

sirbod2005 commented 6 years ago

I have now. Doesn't seem to have changed anything 5 minutes later. Still loads of unresolved alerts. Can they be cleared manually somehow? I'm hoping new alerts may clear!

cvtienhoven commented 6 years ago

Do you have access to the Graylog logfiles in which you see anything related?

cvtienhoven commented 6 years ago

If you can turn on debug logging that would be most helpful I guess.

sirbod2005 commented 6 years ago

Tail'ing /var/log/graylog/server/current I have lots of... [AggregatesMaintenance] Remove 0 history items [AggregatesMaintenance] Removing Aggregate Alert Conditions that don't have associated rule [AggregatesMaintenance] Resolving unresolved Aggregate Alerts that don't have associated rule I'll look further back in the logs to see if there are any errors.

sirbod2005 commented 6 years ago

I've checked through the log and can find no errors. The unresolved alerts were generated last night on 2.2.2, today I updated to 2.2.3 and restarted the OS - if that of any help. How do I enable debugging? Are there any other logs I can look at?

sirbod2005 commented 6 years ago

I can confirm new alerts clear having moved to 2.2.3, it's just the backlog from 2.2.2 I need to get rid of. Is there a sensible way of doing this other than orphaning them by recreating the aggregate rules?

jrvn commented 6 years ago

@sirbod2005 If you need to set unresolved alert as resolved, open the detail of that alert -> Condition details -> Edit & Save. No modification is needed, only save. After editing of condition details, all unresolved (multiple) alerts triggered by this condition will be marked as resolved. It worked for me last weeks as temporary workaround for resolving this bug.

From the version 2.2.3, this bug (#28) doesn't occurs in our environment (@cvtienhoven Thanks !!!). My setup is:

sirbod2005 commented 6 years ago

@jrvn Thanks for the tip. I was initially unable to save the condition as the threshold was set to "Select Threshold Type" on all my aggregate alerts. I had to change to "more than" to be able to save.

jrvn commented 6 years ago

@sirbod2005 You are right, I forgot this. That's because Aggregates plugin uses More or equal and Less values for threshold type and Alert uses More and Less (without equal value). IMHO in this small detail is Aggregates plugin integration not compatible with native Alerts.

cvtienhoven commented 6 years ago

Hey everybody, thanks for all the input on this. I investigated a remaining issue I had myself today regarding unresolved alerts. It seems sometimes (or everytime, I don't know yet) two alerts for the same alert condition get generated with exactly the same timestamp. After evaluating again, only one gets resolved, and the other one remains active. I implemented a workaround for this, so that unresolved alerts get resolved when there's an already resolved duplicate (same stream, same condition, same timestamp).

When I've got some spare time left, I'm going to investigate why those duplicates exist in the first place, but for now you guys can try version 2.2.4

I'm going to close this issue now, if you feel there's still a problem that needs attention, feel free to file a fresh issue.