Strange work of alerts in Graylog 3.1

R4scal commented 4 years ago

Hi

I created alert with paramaters:

Condition Type: Filters & aggregation
Search query: (message:"error" OR message:"panic")
Search within the last: 3m
Execute search every: 30s
Create Events for Definition if...: Filter has results
Notification: HTTP Notification
Grace Period: 30s
Message Backlog: 3

And have some problems.

Expected Behavior

it was expected that if there are errors in the logs, every 30 seconds I will receive a notification with 3 messages

Current Behavior

In fact, for every 1 message from source stream i have 5 events (in event stream) and also 2 hours after the problem, I received a notification about messages in the logs that were for a long time ago

source stream:

All events stream:

Possible Solution

In my opinion, the number of events cannot exceed the initial number of messages in the logs, but in general should be less than it due to the use of the grace period and message backlog options

Your Environment

Graylog Version: 3.1.2
Elasticsearch Version: 6.8.2
MongoDB Version: 4.2
Operating System: Ubuntu 18.04

R4scal commented 4 years ago

Hi, after 7 days there is no answer, but the problem is quite critical for us, so if we have a lot of errors at the same time (several thousand), we get alerts for them for another 2 hours after that. I do not see any errors in the logs of the graylog master node.

And most sadly, a large number of errors in one stream and one rule affects the processing of events in general, all notifications will slow down

bernd commented 4 years ago

@R4scal The events/alerts system is currently not supporting event de-duplication. So if your "execute search every" is smaller than your "search within" you will always get new events created for the same log messages. Using different values for these settings can make sense for aggregation but for a pure filter you usually want both values to be the same.

Does that help?

R4scal commented 4 years ago

@bernd thanks for the clarification, this explains that the number of events is greater than the number of messages. But why does a huge number of events on one stream and rule block notifications for other rules and streams for several hours?

R4scal commented 4 years ago

I see that while the graylog is trying to send "outdated" events, graylog does not receive new ones. It looks like an overflow of some queue but I don’t understand how to track its size or clear it, because I don’t see anything about it it documentation

R4scal commented 4 years ago

In the expected behavior with setting message backlog: 3, it is to send a notification about 3 messages, and drop the rest

R4scal commented 4 years ago

Yesterday at 5 am we had an incident on one of the services that led to the creation of 550 thousand events. Until 2 p.m. we received notifications on these events. From 5 a.m., alerts for all other services do not work. Despite the presence of errors in the logs, events are not created

Mongo use 1 CPU and mongotop show

                        ns     total    read     write    2019-10-10T07:29:16Z
graylog.scheduler_triggers    1164ms     0ms    1164ms
      graylog.index_ranges       4ms     4ms       0ms
            local.oplog.rs       3ms     3ms       0ms
             graylog.nodes       1ms     0ms       1ms
 graylog.processing_status       1ms     0ms       1ms
         admin.system.keys       0ms     0ms       0ms
        admin.system.roles       0ms     0ms       0ms
      admin.system.version       0ms     0ms       0ms
    config.system.sessions       0ms     0ms       0ms
       config.transactions       0ms     0ms       0ms

db.scheduler_triggers.count()
532793

So, in my opinion, event processing slows down on the insert in mongo.

In addition, the number of events for 5 yesterday morning is constantly increasing in the All Events stream. How can I stop processing not relevant events. Just removing the alert?

bernd commented 4 years ago

@R4scal I think what happens in your case is, that you receive a spike of messages and your Graylog setup needs some time to process all the messages from the journal. The alerting system can detect that you have lots of messages buffered in the journal and waits with checking for alerts until the messages are processed. That would explain the 2 hour delay in your setup.

Please check the journal sizes of your setup and let us know if the journal size is high during these incidents. Thanks!

R4scal commented 4 years ago

@bernd I attached a screenshot of journal utilization on all 5 Graylog servers in the last 2 days

And the current alerts view

The screenshot shows that from 5 a.m. yesterday no new alerts were created.

But if you check the rules, the picture will be like this:

So, how to make an alert work again?

R4scal commented 4 years ago

@bernd
over the past day, we have moved one minute in processing alerts and now the time is in the screenshot 2019-10-09 05:20:30 instead of 2019-10-09 05:19:34

Mongo still stably utilizes one of the four CPUs 100%

db.scheduler_triggers.count()
485667

I have all that de question, what can be done to stop receiving notifications about what happened 2 days ago and restore the alerts to work?

Just in case, I’ll clarify that the problem is only with alerts, new events are visible in streams through the search

olsonnn commented 4 years ago

Same issue here, Noticed after 2.4>3.1.2 upgrade and my few alerts are recreated from scratch (no legacy)

The alert is generated but not in time. At random time i do see alerts altough they should have created 1 minute after an event came in.

R4scal commented 4 years ago

@bernd ping

bernd commented 4 years ago

@R4scal You seem to have a lot of scheduler triggers in the database. The triggers usualley get cleaned up after one day. Can you please show us the output of the following MongoDB commands:

db.scheduler_triggers.aggregate([{"$group":{"_id": "$data.type", "count": {"$sum": 1}, "status": {"$addToSet": "$status"}}}]);
db.processing_status.find().pretty();
db.scheduler_triggers.find({"data.type": "event-processor-execution-v1"}, {"data": 1}).pretty();

Thanks!

R4scal commented 4 years ago

rs0:PRIMARY> db.scheduler_triggers.aggregate([{"$group":{"_id": "$data.type", "count": {"$sum": 1}, "status": {"$addToSet": "$status"}}}]);
{ "_id" : "notification-execution-v1", "count" : 281037, "status" : [ "complete", "runnable" ] }
{ "_id" : "event-processor-execution-v1", "count" : 6, "status" : [ "runnable" ] }

rs0:PRIMARY> db.processing_status.find().pretty();
{
    "_id" : ObjectId("5d8dafde15b695068aeed769"),
    "node_id" : "f75d14b4-1fc6-406a-903d-bd18686c3c9e",
    "node_lifecycle_status" : "RUNNING",
    "updated_at" : ISODate("2019-10-26T09:08:22.039Z"),
    "receive_times" : {
        "ingest" : ISODate("2019-10-26T09:08:22.038Z"),
        "post_processing" : ISODate("2019-10-26T09:08:22.038Z"),
        "post_indexing" : ISODate("2019-10-26T09:08:21.640Z")
    },
    "input_journal" : {
        "uncommitted_entries" : NumberLong(452),
        "read_messages_1m_rate" : 1374.416729101512,
        "written_messages_1m_rate" : 1374.4153110404084
    }
}
{
    "_id" : ObjectId("5d8db12115b695068af06a21"),
    "node_id" : "301c4583-aada-47e0-8aec-10c450b65f15",
    "node_lifecycle_status" : "RUNNING",
    "updated_at" : ISODate("2019-10-26T09:08:21.724Z"),
    "receive_times" : {
        "ingest" : ISODate("2019-10-26T09:08:21.722Z"),
        "post_processing" : ISODate("2019-10-26T09:08:21.722Z"),
        "post_indexing" : ISODate("2019-10-26T09:08:21.327Z")
    },
    "input_journal" : {
        "uncommitted_entries" : NumberLong(378),
        "read_messages_1m_rate" : 2983.908978097082,
        "written_messages_1m_rate" : 2983.906193823483
    }
}
{
    "_id" : ObjectId("5d8db15115b695068af093e8"),
    "node_id" : "ad8c8c07-ae70-4e64-87e7-9cf88d7806cc",
    "node_lifecycle_status" : "RUNNING",
    "updated_at" : ISODate("2019-10-26T09:08:22.327Z"),
    "receive_times" : {
        "ingest" : ISODate("2019-10-26T09:08:22.323Z"),
        "post_processing" : ISODate("2019-10-26T09:08:22.323Z"),
        "post_indexing" : ISODate("2019-10-26T09:08:22.202Z")
    },
    "input_journal" : {
        "uncommitted_entries" : NumberLong(257),
        "read_messages_1m_rate" : 3554.0791279972864,
        "written_messages_1m_rate" : 3554.0741012256744
    }
}
{
    "_id" : ObjectId("5d8db17a15b695068af0c07b"),
    "node_id" : "87f961f5-6c11-47ff-a498-53e3870a03e6",
    "node_lifecycle_status" : "RUNNING",
    "updated_at" : ISODate("2019-10-26T09:08:22.374Z"),
    "receive_times" : {
        "ingest" : ISODate("2019-10-26T09:08:22.373Z"),
        "post_processing" : ISODate("2019-10-26T09:08:22.372Z"),
        "post_indexing" : ISODate("2019-10-26T09:08:22.158Z")
    },
    "input_journal" : {
        "uncommitted_entries" : NumberLong(318),
        "read_messages_1m_rate" : 2148.4099204236168,
        "written_messages_1m_rate" : 2148.4076110791098
    }
}
{
    "_id" : ObjectId("5d8db1ab15b695068af0ef10"),
    "node_id" : "b5ae3ee9-c62a-405a-9534-f4b4667282a7",
    "node_lifecycle_status" : "RUNNING",
    "updated_at" : ISODate("2019-10-26T09:08:22.327Z"),
    "receive_times" : {
        "ingest" : ISODate("2019-10-26T09:08:22.325Z"),
        "post_processing" : ISODate("2019-10-26T09:08:22.325Z"),
        "post_indexing" : ISODate("2019-10-26T09:08:22.124Z")
    },
    "input_journal" : {
        "uncommitted_entries" : NumberLong(325),
        "read_messages_1m_rate" : 1499.1910916288716,
        "written_messages_1m_rate" : 1499.206341139004
    }
}

rs0:PRIMARY> db.scheduler_triggers.find({"data.type": "event-processor-execution-v1"}, {"data": 1}).pretty();
{
    "_id" : ObjectId("5d90ff5d89737174d380623f"),
    "data" : {
        "type" : "event-processor-execution-v1",
        "timerange_from" : ISODate("2019-10-20T00:00:15.968Z"),
        "timerange_to" : ISODate("2019-10-20T00:03:15.967Z")
    }
}
{
    "_id" : ObjectId("5d910ddf0632f15e37f292a6"),
    "data" : {
        "type" : "event-processor-execution-v1",
        "timerange_from" : ISODate("2019-10-20T00:00:39.690Z"),
        "timerange_to" : ISODate("2019-10-20T00:03:39.689Z")
    }
}
{
    "_id" : ObjectId("5d910f124bfbad651e02aa07"),
    "data" : {
        "type" : "event-processor-execution-v1",
        "timerange_from" : ISODate("2019-10-20T00:00:46.218Z"),
        "timerange_to" : ISODate("2019-10-20T00:03:46.217Z")
    }
}
{
    "_id" : ObjectId("5d91daa4b17cd514700331e1"),
    "data" : {
        "type" : "event-processor-execution-v1",
        "timerange_from" : ISODate("2019-10-20T00:01:13.749Z"),
        "timerange_to" : ISODate("2019-10-20T00:04:13.748Z")
    }
}
{
    "_id" : ObjectId("5d91db49abe3d19cc521d067"),
    "data" : {
        "type" : "event-processor-execution-v1",
        "timerange_from" : ISODate("2019-10-20T00:01:02.809Z"),
        "timerange_to" : ISODate("2019-10-20T00:04:02.808Z")
    }
}
{
    "_id" : ObjectId("5d9200644bfbad4c9fde7900"),
    "data" : {
        "type" : "event-processor-execution-v1",
        "timerange_from" : ISODate("2019-10-19T23:59:15.261Z"),
        "timerange_to" : ISODate("2019-10-20T00:02:15.260Z")
    }
}

bernd commented 4 years ago

@R4scal Thanks for the info!

You Graylog setup is basically catching up with processing alerts. The last output shows that it's currently processing the data around the date "2019-10-20". We will implement some improvements in the future to make this catch-up process faster. Until then you could temporarily remove the notifications until the system has caught up to avoid sending a lot of notifications.

R4scal commented 4 years ago

@bernd Thanks

As workaround I can delete all notifications older than 30 minutes as cron job?

db.scheduler_triggers.deleteMany({  "data.type" : "notification-execution-v1",   "triggered_at": { $lt: new Date(ISODate().getTime() - 1000 * 60 * 30) } })

As far as I can see, the main problem is the Message Backlog option does not work. Will you correct that in 3.2 or in 3.1?

bernd commented 4 years ago

@R4scal To make that workaround work you have to add another condition on status:

db.scheduler_triggers.deleteMany({  "data.type" : "notification-execution-v1", "status": "runnable",   "triggered_at": { $lt: new Date(ISODate().getTime() - 1000 * 60 * 30) } });

Only runnable job triggers haven't sent a notification yet. Other job triggers for notification-execution-v1 that have status complete already sent notifications and it doesn't make sense to delete them manually. (they are actually cleaned up automatically on a regular basis)

As far as I can see, the main problem is the Message Backlog option does not work. Will you correct that in 3.2 or in 3.1?

What do you mean by it doesn't work?

R4scal commented 4 years ago

Backlog messages was supposed to limit the number of notifications, but it didn’t do it, which is why the situation with mongo described above

bernd commented 4 years ago

@R4scal The backlog configuration options defines the number of related messages are included in the alert notification. If it's set to 5, each notification will include a list of 5 messages that are related to the alert.

The grace period can be used to block new notifications for a certain amount of time. Say you set the grace period to 5 minutes. That means there will be no new notification for this event definition (and a specific key, if configured) if one has been sent in the last 5 minutes.

I hope that helps! :smiley:

R4scal commented 4 years ago

I have grace period of Grace Period: 30s :) How does the Grace period coexist with key fields? If, during grace period, will be created a new alert with a unique key notification will be sended?

bernd commented 4 years ago

@R4scal Yes, if there are two notifications with the same key within 30 seconds, only the first one will be sent. If the key is different, a new grace period is taken into account. If there is no key configured, the grace period is taken into account for all notifications for the event definition.

R4scal commented 4 years ago

Yesterday we again had a surge in 600k of errors per minute, and the script above did not help, it is necessary to replace triggered_at -> start_time

bernd commented 4 years ago

@R4scal What happened during that surge? Have you received 600k notifications? If the configuration you pasted above is still the same, you are searching for errors over the last 3 minutes every 30 seconds. This will create duplicate events. If you don't configure an aggregation, it doesn't make much sense to use different values for "search within" and "execute every". (because it will create duplicate events)

R4scal commented 4 years ago

@bernd In current config I'm reduced search windows for 2 minutes every 1 minute. In cases with log delays (for different reasons, like async logging, buffering proxy, and some one) we need alert interval overlap. So, if logs inserts with delay and no overlap interval - we have missing alert. In old alerts system it's works perfect.

In new alert system, when we have a surge of errors in one stream, we have broken processing in all streams (The charts show the results of queries to the mongo which are given above)

baonq-me commented 1 year ago

Is there any progress on this issue? I'm having this issue, happen again and again. https://community.graylog.org/t/wrong-calculation-of-next-execution-in-graylog-alerts/26850

Hi @R4scal, can you share your dashboard with me?

baonq-me commented 1 year ago

@bernd Is there any progress on this issue?

Graylog2 / graylog2-server