Open R4scal opened 4 years ago
Hi, after 7 days there is no answer, but the problem is quite critical for us, so if we have a lot of errors at the same time (several thousand), we get alerts for them for another 2 hours after that. I do not see any errors in the logs of the graylog master node.
And most sadly, a large number of errors in one stream and one rule affects the processing of events in general, all notifications will slow down
@R4scal The events/alerts system is currently not supporting event de-duplication. So if your "execute search every" is smaller than your "search within" you will always get new events created for the same log messages. Using different values for these settings can make sense for aggregation but for a pure filter you usually want both values to be the same.
Does that help?
@bernd thanks for the clarification, this explains that the number of events is greater than the number of messages. But why does a huge number of events on one stream and rule block notifications for other rules and streams for several hours?
I see that while the graylog is trying to send "outdated" events, graylog does not receive new ones. It looks like an overflow of some queue but I don’t understand how to track its size or clear it, because I don’t see anything about it it documentation
In the expected behavior with setting message backlog: 3
, it is to send a notification about 3 messages, and drop the rest
Yesterday at 5 am we had an incident on one of the services that led to the creation of 550 thousand events. Until 2 p.m. we received notifications on these events. From 5 a.m., alerts for all other services do not work. Despite the presence of errors in the logs, events are not created
Mongo use 1 CPU and mongotop show
ns total read write 2019-10-10T07:29:16Z
graylog.scheduler_triggers 1164ms 0ms 1164ms
graylog.index_ranges 4ms 4ms 0ms
local.oplog.rs 3ms 3ms 0ms
graylog.nodes 1ms 0ms 1ms
graylog.processing_status 1ms 0ms 1ms
admin.system.keys 0ms 0ms 0ms
admin.system.roles 0ms 0ms 0ms
admin.system.version 0ms 0ms 0ms
config.system.sessions 0ms 0ms 0ms
config.transactions 0ms 0ms 0ms
db.scheduler_triggers.count()
532793
So, in my opinion, event processing slows down on the insert in mongo.
In addition, the number of events for 5 yesterday morning is constantly increasing in the All Events
stream. How can I stop processing not relevant events. Just removing the alert?
@R4scal I think what happens in your case is, that you receive a spike of messages and your Graylog setup needs some time to process all the messages from the journal. The alerting system can detect that you have lots of messages buffered in the journal and waits with checking for alerts until the messages are processed. That would explain the 2 hour delay in your setup.
Please check the journal sizes of your setup and let us know if the journal size is high during these incidents. Thanks!
@bernd I attached a screenshot of journal utilization on all 5 Graylog servers in the last 2 days
And the current alerts view
The screenshot shows that from 5 a.m. yesterday no new alerts were created.
But if you check the rules, the picture will be like this:
So, how to make an alert work again?
@bernd
over the past day, we have moved one minute in processing alerts and now the time is in the screenshot 2019-10-09 05:20:30 instead of 2019-10-09 05:19:34
Mongo still stably utilizes one of the four CPUs 100%
db.scheduler_triggers.count()
485667
I have all that de question, what can be done to stop receiving notifications about what happened 2 days ago and restore the alerts to work?
Just in case, I’ll clarify that the problem is only with alerts, new events are visible in streams through the search
Same issue here, Noticed after 2.4>3.1.2 upgrade and my few alerts are recreated from scratch (no legacy)
The alert is generated but not in time. At random time i do see alerts altough they should have created 1 minute after an event came in.
@bernd ping
@R4scal You seem to have a lot of scheduler triggers in the database. The triggers usualley get cleaned up after one day. Can you please show us the output of the following MongoDB commands:
db.scheduler_triggers.aggregate([{"$group":{"_id": "$data.type", "count": {"$sum": 1}, "status": {"$addToSet": "$status"}}}]);
db.processing_status.find().pretty();
db.scheduler_triggers.find({"data.type": "event-processor-execution-v1"}, {"data": 1}).pretty();
Thanks!
rs0:PRIMARY> db.scheduler_triggers.aggregate([{"$group":{"_id": "$data.type", "count": {"$sum": 1}, "status": {"$addToSet": "$status"}}}]);
{ "_id" : "notification-execution-v1", "count" : 281037, "status" : [ "complete", "runnable" ] }
{ "_id" : "event-processor-execution-v1", "count" : 6, "status" : [ "runnable" ] }
rs0:PRIMARY> db.processing_status.find().pretty();
{
"_id" : ObjectId("5d8dafde15b695068aeed769"),
"node_id" : "f75d14b4-1fc6-406a-903d-bd18686c3c9e",
"node_lifecycle_status" : "RUNNING",
"updated_at" : ISODate("2019-10-26T09:08:22.039Z"),
"receive_times" : {
"ingest" : ISODate("2019-10-26T09:08:22.038Z"),
"post_processing" : ISODate("2019-10-26T09:08:22.038Z"),
"post_indexing" : ISODate("2019-10-26T09:08:21.640Z")
},
"input_journal" : {
"uncommitted_entries" : NumberLong(452),
"read_messages_1m_rate" : 1374.416729101512,
"written_messages_1m_rate" : 1374.4153110404084
}
}
{
"_id" : ObjectId("5d8db12115b695068af06a21"),
"node_id" : "301c4583-aada-47e0-8aec-10c450b65f15",
"node_lifecycle_status" : "RUNNING",
"updated_at" : ISODate("2019-10-26T09:08:21.724Z"),
"receive_times" : {
"ingest" : ISODate("2019-10-26T09:08:21.722Z"),
"post_processing" : ISODate("2019-10-26T09:08:21.722Z"),
"post_indexing" : ISODate("2019-10-26T09:08:21.327Z")
},
"input_journal" : {
"uncommitted_entries" : NumberLong(378),
"read_messages_1m_rate" : 2983.908978097082,
"written_messages_1m_rate" : 2983.906193823483
}
}
{
"_id" : ObjectId("5d8db15115b695068af093e8"),
"node_id" : "ad8c8c07-ae70-4e64-87e7-9cf88d7806cc",
"node_lifecycle_status" : "RUNNING",
"updated_at" : ISODate("2019-10-26T09:08:22.327Z"),
"receive_times" : {
"ingest" : ISODate("2019-10-26T09:08:22.323Z"),
"post_processing" : ISODate("2019-10-26T09:08:22.323Z"),
"post_indexing" : ISODate("2019-10-26T09:08:22.202Z")
},
"input_journal" : {
"uncommitted_entries" : NumberLong(257),
"read_messages_1m_rate" : 3554.0791279972864,
"written_messages_1m_rate" : 3554.0741012256744
}
}
{
"_id" : ObjectId("5d8db17a15b695068af0c07b"),
"node_id" : "87f961f5-6c11-47ff-a498-53e3870a03e6",
"node_lifecycle_status" : "RUNNING",
"updated_at" : ISODate("2019-10-26T09:08:22.374Z"),
"receive_times" : {
"ingest" : ISODate("2019-10-26T09:08:22.373Z"),
"post_processing" : ISODate("2019-10-26T09:08:22.372Z"),
"post_indexing" : ISODate("2019-10-26T09:08:22.158Z")
},
"input_journal" : {
"uncommitted_entries" : NumberLong(318),
"read_messages_1m_rate" : 2148.4099204236168,
"written_messages_1m_rate" : 2148.4076110791098
}
}
{
"_id" : ObjectId("5d8db1ab15b695068af0ef10"),
"node_id" : "b5ae3ee9-c62a-405a-9534-f4b4667282a7",
"node_lifecycle_status" : "RUNNING",
"updated_at" : ISODate("2019-10-26T09:08:22.327Z"),
"receive_times" : {
"ingest" : ISODate("2019-10-26T09:08:22.325Z"),
"post_processing" : ISODate("2019-10-26T09:08:22.325Z"),
"post_indexing" : ISODate("2019-10-26T09:08:22.124Z")
},
"input_journal" : {
"uncommitted_entries" : NumberLong(325),
"read_messages_1m_rate" : 1499.1910916288716,
"written_messages_1m_rate" : 1499.206341139004
}
}
rs0:PRIMARY> db.scheduler_triggers.find({"data.type": "event-processor-execution-v1"}, {"data": 1}).pretty();
{
"_id" : ObjectId("5d90ff5d89737174d380623f"),
"data" : {
"type" : "event-processor-execution-v1",
"timerange_from" : ISODate("2019-10-20T00:00:15.968Z"),
"timerange_to" : ISODate("2019-10-20T00:03:15.967Z")
}
}
{
"_id" : ObjectId("5d910ddf0632f15e37f292a6"),
"data" : {
"type" : "event-processor-execution-v1",
"timerange_from" : ISODate("2019-10-20T00:00:39.690Z"),
"timerange_to" : ISODate("2019-10-20T00:03:39.689Z")
}
}
{
"_id" : ObjectId("5d910f124bfbad651e02aa07"),
"data" : {
"type" : "event-processor-execution-v1",
"timerange_from" : ISODate("2019-10-20T00:00:46.218Z"),
"timerange_to" : ISODate("2019-10-20T00:03:46.217Z")
}
}
{
"_id" : ObjectId("5d91daa4b17cd514700331e1"),
"data" : {
"type" : "event-processor-execution-v1",
"timerange_from" : ISODate("2019-10-20T00:01:13.749Z"),
"timerange_to" : ISODate("2019-10-20T00:04:13.748Z")
}
}
{
"_id" : ObjectId("5d91db49abe3d19cc521d067"),
"data" : {
"type" : "event-processor-execution-v1",
"timerange_from" : ISODate("2019-10-20T00:01:02.809Z"),
"timerange_to" : ISODate("2019-10-20T00:04:02.808Z")
}
}
{
"_id" : ObjectId("5d9200644bfbad4c9fde7900"),
"data" : {
"type" : "event-processor-execution-v1",
"timerange_from" : ISODate("2019-10-19T23:59:15.261Z"),
"timerange_to" : ISODate("2019-10-20T00:02:15.260Z")
}
}
@R4scal Thanks for the info!
You Graylog setup is basically catching up with processing alerts. The last output shows that it's currently processing the data around the date "2019-10-20". We will implement some improvements in the future to make this catch-up process faster. Until then you could temporarily remove the notifications until the system has caught up to avoid sending a lot of notifications.
@bernd Thanks
As workaround I can delete all notifications older than 30 minutes as cron job?
db.scheduler_triggers.deleteMany({ "data.type" : "notification-execution-v1", "triggered_at": { $lt: new Date(ISODate().getTime() - 1000 * 60 * 30) } })
As far as I can see, the main problem is the Message Backlog
option does not work. Will you correct that in 3.2 or in 3.1?
@R4scal To make that workaround work you have to add another condition on status
:
db.scheduler_triggers.deleteMany({ "data.type" : "notification-execution-v1", "status": "runnable", "triggered_at": { $lt: new Date(ISODate().getTime() - 1000 * 60 * 30) } });
Only runnable
job triggers haven't sent a notification yet. Other job triggers for notification-execution-v1
that have status complete
already sent notifications and it doesn't make sense to delete them manually. (they are actually cleaned up automatically on a regular basis)
As far as I can see, the main problem is the Message Backlog option does not work. Will you correct that in 3.2 or in 3.1?
What do you mean by it doesn't work?
Backlog messages was supposed to limit the number of notifications, but it didn’t do it, which is why the situation with mongo described above
@R4scal The backlog configuration options defines the number of related messages are included in the alert notification. If it's set to 5, each notification will include a list of 5 messages that are related to the alert.
The grace period can be used to block new notifications for a certain amount of time. Say you set the grace period to 5 minutes. That means there will be no new notification for this event definition (and a specific key, if configured) if one has been sent in the last 5 minutes.
I hope that helps! :smiley:
I have grace period of Grace Period: 30s
:) How does the Grace period coexist with key fields? If, during grace period, will be created a new alert with a unique key notification will be sended?
@R4scal Yes, if there are two notifications with the same key within 30 seconds, only the first one will be sent. If the key is different, a new grace period is taken into account. If there is no key configured, the grace period is taken into account for all notifications for the event definition.
Yesterday we again had a surge in 600k of errors per minute, and the script above did not help, it is necessary to replace triggered_at
-> start_time
@R4scal What happened during that surge? Have you received 600k notifications? If the configuration you pasted above is still the same, you are searching for errors over the last 3 minutes every 30 seconds. This will create duplicate events. If you don't configure an aggregation, it doesn't make much sense to use different values for "search within" and "execute every". (because it will create duplicate events)
@bernd In current config I'm reduced search windows for 2 minutes every 1 minute. In cases with log delays (for different reasons, like async logging, buffering proxy, and some one) we need alert interval overlap. So, if logs inserts with delay and no overlap interval - we have missing alert. In old alerts system it's works perfect.
In new alert system, when we have a surge of errors in one stream, we have broken processing in all streams (The charts show the results of queries to the mongo which are given above)
Is there any progress on this issue? I'm having this issue, happen again and again. https://community.graylog.org/t/wrong-calculation-of-next-execution-in-graylog-alerts/26850
Hi @R4scal, can you share your dashboard with me?
@bernd Is there any progress on this issue?
Hi
I created alert with paramaters:
(message:"error" OR message:"panic")
And have some problems.
Expected Behavior
it was expected that if there are errors in the logs, every 30 seconds I will receive a notification with 3 messages
Current Behavior
In fact, for every 1 message from source stream i have 5 events (in event stream) and also 2 hours after the problem, I received a notification about messages in the logs that were for a long time ago
source stream:
All events stream:
Possible Solution
In my opinion, the number of events cannot exceed the initial number of messages in the logs, but in general should be less than it due to the use of the grace period and message backlog options
Your Environment