Possible Stale Connection Icingabeat and Icinga2

DMMCA commented 7 years ago

Hello, I want to report a possible issue with the icingabeat application. My testing environment has the following setup: Icinga Master(with Icingabeats) and clients, no zones defined, just the standard thing and another VM with ELK. The problem was that when i left the ELK VM down for the weekend, and turned it back up on monday i was receiving events from saturday with a huge delay between them beeing displayed on Kibana, left it down to test the scenario of a client losing his Link (this one being on a different GeoLocation from the ELK Server). Several hours passed and still no "LIVE EVENT STREAM" were being displayed, only when i restarted the icingabeat it became "live".

Best Regards,

DMMCA

bobapple commented 7 years ago

@DMMCA Just to make sure I get it correct: Did the connection between icingabeat <-> Elasticsearch break or between Icinga2 <-> icingabeat?

dnsmichi commented 7 years ago

Original discussion: https://monitoring-portal.org/index.php?thread/40420-icingabeat/

bobapple commented 7 years ago

There are two things that can happen regarding connection losses:

Icingabeat loses the connection to Icinga2 In this case, Icingabeat will try to reconnect periodically to Icinga2. The interval for reconnects can be configured in retry_interval. When the connection is established again, Icinga2 continuous to send event from that point. There is no buffering, events from the past are lost. There is however a feature request open to add buffering to the event stream API: https://github.com/Icinga/icinga2/issues/4604
Icingabeat loses connection to Elasticsearch (or any other output) As I understand this is what happened to you. If Icingabeat loses the connection to Elasticsearch but keeps getting events from the Icinga2 API, these events are stored in queues. The size of the queues is configurable through queue_size and bulk_queue_size. From my understanding your queues were big enough to store old events. Therefore, when Elasticsearch came up again, all events in the queues were sent to Elasticsearch. These queues are in-memory and configuring too high values can fill up your memory. There is a discussion about adding a feature to libbeat to buffer events to the disk: https://github.com/elastic/beats/issues/575

Hope I could help you!

DMMCA commented 7 years ago

I was using the short version icingabeat.yml config file not the full. I thought the values you talking about were being loaded by default:

-----EXTRATED FROM FULL VERSION-----
# Internal queue size for single events in processing pipeline
#queue_size: 1000

# The internal queue size for bulk events in the processing pipeline.
# Do not modify this value.
#bulk_queue_size: 0
--------------------------------------------------
------From my most recent log after restarting icingabeat ----------------------------------
INFO Max Retries set to: 3
INFO Flush Interval set to: 1s
INFO Max Bulk Size set to: 50
INFO Flush Interval set to: 1s
--------------------------------------------------

So you your advice is to configure those values to a minimum of what and if the full version should be used instead?

bobapple commented 7 years ago

You don't need to use the full version, you can just copy the settings you want to override to your icingabeat.yml. If you don't want buffering, I suggest you set the queue_size to a minimum, let's say queue_size: 5.

bobapple commented 7 years ago

One more thing: With the max_retries setting you can configure how often icingabeat will try to send an event to Elasticsearch before it gives up and drops the event. The default is max_retries: 3. Decreasing this will result in fewer events in the queue because they get dropped faster.

Icinga / icingabeat

Possible Stale Connection Icingabeat and Icinga2 #9