[dev.icinga.com #8141] Optimized Freshness Checking - Githubissues

icinga-migration commented 9 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/8141

Created by jrhunt on 2014-12-24 21:54:06 +00:00

Assignee: (none) Status: New Target Version: Backlog Last Update: 2015-05-18 12:18:15 +00:00 (in Redmine)

I have attached a patch that optimizes host and service freshness checking.

My team and I run Icinga 1.x in a very large environment (~60k service checks on our largest node, 95% of which are passive submission, every 5 minutes). Using a custom event broker to keep up with the load of inbound check results, we have noticed very few scaling problems. However, when we have large swaths of the infrastructure fail, like when a hypervisor dies, we see a large influx of freshness checks, which causes the Icinga process to thrash as it tries (in vain) to schedule tens of thousands of check_dummy active checks to report that the services are stale.

To remedy this, we patched Icinga to recognize two new host and service attributes:

freshness_status - Numeric status code to use when the check is determined to be stale, with the usual meanings (0 = OK, etc.)
freshness_message - A description explaining the nature of the freshness failure.

For example, for passive services that are usually fed by our monitoring agent software, we have defined these two attributes as such:

    check_freshness 1
    freshness_threshold 900
    freshness_status 1
    freshness_message "No result from monitoring agent in over 15 minutes"

We then modified the Icinga check_*_service_result_freshness() functions to bypass the normal schedule-an-active-check-run behavior if these attributes are present, and instead synthesize a check result and inject it into the check results list.

Upon exercising this code in our testbed environment, we noticed that recovery from large scale outages (~20k stale checks at a time) would cause Icinga to thrash, first marking all the stale services as stale, and then processing all of the inbound results from our event broker, and then marking everything as stale again, etc. We found it prudent to reap all of the check results on every 1000th stale check, to keep this particular undesired behavior from occuring.

Note that for configurations that don't specify these attributes, the current schedule-an-active-check-run behavior persists, to preserve backwards compatibility.

Also note that earlier versions of this patch (as discussed in #icinga-devel and between dnsmichi and myself,iamjameshunt, on twitter) made problematic changes to the add_host() and add_service() functions. This version of the patch does not suffer from this problem, at the expense of not allowing event brokers to pass the freshness_status / freshness_message attributes via those functions.

Attachments

icinga-fresh.patch jrhunt - 2014-12-24 21:53:54 +00:00

Relations:

relates #7071

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-01-24 12:59:37 +00:00

Sorry for the delayed answer, January does not seem to be a good month.

While I like the initial idea, I'm holding off to add new options in terms of configuration and state to Icinga 1.x. As you already mentioned, the event broker modules won't receive these attributes and their values, but they probably should as for writing to idoutils db representing that in Icinga Web and so on. Similar issue with Classic UI. Livestatus is out of the scope, that's not maintained by Icinga in 1.x.

We do have engineering to do on Icinga 2 with passive checks and freshness (#7071), combined with api ideas of feeding passive checks into the core. I will take your ideas and patch design into account once we are there.

Curious what others think though :)

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-04-03 16:30:09 +00:00

Relates set to 7071

icinga-migration commented 9 years ago

Updated by berk on 2015-05-18 12:18:15 +00:00

Target Version set to Backlog

Icinga / icinga-core

[dev.icinga.com #8141] Optimized Freshness Checking #1537