Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 575 forks source link

[dev.icinga.com #7287] Re-checks scheduling w/ retry_interval #2009

Closed icinga-migration closed 8 years ago

icinga-migration commented 10 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/7287

Created by wimmer on 2014-09-20 15:16:39 +00:00

Assignee: mfriedrich Status: Resolved (closed on 2016-01-22 18:12:39 +00:00) Target Version: 2.4.2 Last Update: 2016-03-09 10:36:02 +00:00 (in Redmine)

Icinga Version: 2.1.1

I found strange behavior of re-checks scheduling. I use Icinga1 and Icinga2 Debian packages and I use the same attributes for checks:

check_interval = 5m
retry_interval = 1m
max_check_attempts = 3

When some service changes its status, Icinga1 works as expected:

Current Status  Current Attempt         Last Check Time             Next Scheduled Active Check
OK                     1/3  (HARD state)       20-09-2014 15:00:00     20-09-2014 15:05:00
CRITICAL        1/3  (SOFT state)        20-09-2014 15:05:00     20-09-2014 15:06:00
CRITICAL        2/3  (SOFT state)        20-09-2014 15:06:00     20-09-2014 15:07:00
CRITICAL        3/3  (HARD state)       20-09-2014 15:07:00     20-09-2014 15:12:00
CRITICAL        3/3  (HARD state)       20-09-2014 15:12:00     20-09-2014 15:17:00

But Icinga2 uses strange Attempt numbering [there is no 1/3 (SOFT state)] and bad time for scheduling of first re-check (15:10 instead of 15:06). It looks like Icinga2 misses "1/3 SOFT state" phase. There is wrong scheduling in CRITICAL/HARD states too (compare with Icinga1 times):

Current Status  Current Attempt          Last Check Time            Next Scheduled Active Check
OK                     1/3  (HARD state)       09-20-2014 15:00:10     09-20-2014 15:05:10
CRITICAL         2/3  (SOFT state)       09-20-2014 15:05:10     09-20-2014 15:10:10
CRITICAL         3/3  (SOFT state)       09-20-2014 15:10:10     09-20-2014 15:11:10
CRITICAL         1/3  (HARD state)       09-20-2014 15:11:10     09-20-2014 15:12:10
CRITICAL         1/3  (HARD state)       09-20-2014 15:12:10     09-20-2014 15:15:10

Attachments

Changesets

2015-03-11 15:33:36 +00:00 by mfriedrich 05c237c780d4cdbce4ad844febec0d012cf7796f

Don't increment check attempt counter on OK->NOT-OK transition

refs #7287

Signed-off-by: Michael Friedrich <michael.friedrich@netways.de>

2015-08-21 08:24:49 +00:00 by mfriedrich 6f252bb5c5ac143577a3a19db64428d013480e7d

Don't increment check attempt counter on OK->NOT-OK transition

This fixes the problem that the first SOFT state is actually considered
the second state.

refs #7287
fixes #9897

Signed-off-by: Michael Friedrich <michael.friedrich@netways.de>

2016-01-20 15:29:01 +00:00 by mfriedrich a51e647cc760bd5f7c4de6182961a477478c11a9

Fix check scheduling w/ retry_interval

fixes #7287

2016-01-22 17:40:14 +00:00 by mfriedrich 2a11b27972e4325bf80e9abc9017eab7dd03e712

Properly set the next check time for active and passive checks

fixes #7287
refs #11019

2016-02-23 08:24:57 +00:00 by mfriedrich 9ca7245c188c2cb57d0773024e339ad59ad08548

Properly set the next check time for active and passive checks

fixes #7287
refs #11019

2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61820756066dae5ba9f1e727afdc2d4801c

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-05 17:16:49 +00:00 by mfriedrich ef532f20eb4c168dc532d7f9532d88409a821902

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:03 +00:00 by mfriedrich 8344f749cdcbcb5fc94cdb6436a9004e0a54d6a2

Revert "Properly set the next check time for active and passive checks"

This reverts commit 2a11b27972e4325bf80e9abc9017eab7dd03e712.

This patch does not properly work and breaks the check_interval setting
for passive checks. Requires a proper patch.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

2016-03-11 14:55:14 +00:00 by mfriedrich f99feab08c8929b8e2db8a49689b8f09bd3c3448

Revert "Fix check scheduling w/ retry_interval"

This reverts commit a51e647cc760bd5f7c4de6182961a477478c11a9.

This patch causes trouble with check results received
1) passively 2) throughout the cluster. A proper patch
for setting the retry_interval on NOT-OK state changes
is required.

refs #11248
refs #11257
refs #11273

(the old issue)
refs #7287

Relations:

icinga-migration commented 10 years ago

Updated by mfriedrich on 2014-10-11 17:18:26 +00:00

icinga-migration commented 9 years ago

Updated by wimmer on 2014-11-23 13:33:59 +00:00

After 2 months I tried scheduling again. I use Debian packages version 2.2. Scheduler is still confused. Here is my finding:

Current Status  Current Attempt            Last Check Time            Next Scheduled Active Check
OK                      1/3  (HARD state)        11-23-2014 10:00:40     11-23-2014 10:05:40
CRITICAL          2/3  (SOFT state)        11-23-2014 10:05:40      11-23-2014 10:10:40
CRITICAL          3/3  (SOFT state)        11-23-2014 10:10:40      11-23-2014 10:11:40
CRITICAL          1/3  (HARD state)       11-23-2014 10:11:40      11-23-2014 10:12:40
CRITICAL          1/3  (HARD state)       11-23-2014 10:12:40      11-23-2014 10:15:40

Please can somebody confirm this bug?

I would like to use Icinga2 on my production systems, but is impossible with this strange behavior. Thanks for your work.

icinga-migration commented 9 years ago

Updated by mfriedrich on 2014-11-24 13:38:29 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-01-26 15:05:56 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-02-20 20:32:16 +00:00

Please test the attached patch, untested.

icinga-migration commented 9 years ago

Updated by schirrmie on 2015-03-05 07:25:22 +00:00

Hi, here are two issues. The first one is that the check attempt is wrong counted if state changed from soft to hard. Your attached patch worked fine for us and this issue is solved -> thank You! The second issue is that the retry_interval is not handled correctly, we have the same bug here with icinga 2.2.4. Here is an example from us:

max_check_attempts = 3
check_interval = 2m
retry_interval = 1m

state       last        next        attempt
ok          07:49       07:51       1/3 H
critical    07:51       07:53       1/3 S   # wrong
critical    07:53       07:54       2/3 S
critical    07:54       07:55       3/3 S
critical    07:55       07:56       1/3 H   # wrong
critical    07:56       07:57       1/3 H   # wrong
critical    07:57       07:59       1/3 H
ok          07:59       08:01       1/3 H

Your patch does not change the behaviour of the retry_interval. I hope this helps.

Regards schirrmie

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-03-07 15:35:40 +00:00

Ok, thanks for testing. Then I'll leave this issue open for 2.4 and won't apply the patch for 2.3 for now. I don't want to break the release.

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-03-11 15:34:03 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-04-08 14:28:41 +00:00

Ok, found the issue we've been discussing yesterday evening before the SIG NOC meeting :)

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-05-18 13:11:00 +00:00

I'll need to dig into this a little deeper. I'm not sure if this might happen for 2.4.0 - feel free to submit any patches helping resolve the issue.

icinga-migration commented 9 years ago

Updated by zearan on 2015-05-26 12:03:02 +00:00

The problem seems to lie in file libs/icinga/checkable-check.cpp. The function Checkable::ExecuteCheck() (line 521) executes the calculation of the next check before running the check itself:

void Checkable::ExecuteCheck()
{
[...]
UpdateNextCheck();
[...]
GetCheckCommand()->Execute(this, cr, NULL, false);
[...]
}

The time of the next check does not get re-evaluated by Checkable::ProcessCheckResult() (line 237) when the result is available. I have tried to fix it by a one-liner but I cannot say if this causes any unwanted side effects. At least it did not crash anything when I tried it.

icinga-migration commented 9 years ago

Updated by wimmer on 2015-06-19 17:34:18 +00:00

I applied patch to 2.3.5 and it works great. Please insert it into master branch. Thank you very much for your work.

Here is my setup:

object Host "hhh" {
  import "generic-host"
    check_interval = 5m
    retry_interval = 1m
    max_check_attempts = 3
  address = "hhh"
}

object Service "SMTP" {
  import "generic-service"
    check_interval = 5m
    retry_interval = 1m
    max_check_attempts = 3
  host_name = "hhh"
  check_command = "smtp"
}

and results:

Service SMTP:
==============
Current Status: Current Attempt:        Last Check Time:        Next Scheduled Active Check:
OK                     1/3  (HARD state)       06-19-2015 18:27:43     06-19-2015 18:32:43
CRITICAL        2/3  (SOFT state)       06-19-2015 18:32:43     06-19-2015 18:33:43
CRITICAL        3/3  (SOFT state)       06-19-2015 18:33:43     06-19-2015 18:34:43
CRITICAL        1/3  (HARD state)       06-19-2015 18:34:43     06-19-2015 18:37:43
CRITICAL        1/3  (HARD state)       06-19-2015 18:37:43     06-19-2015 18:42:43
OK                     1/3  (HARD state)       06-19-2015 18:42:43     06-19-2015 18:47:43

Host:
==========
Current Status: Current Attempt:        Last Check Time:        Next Scheduled Active Check:
UP                     1/3  (HARD state)       06-19-2015 18:54:35     06-19-2015 18:59:31     
DOWN              2/3  (SOFT state)       06-19-2015 18:59:34     06-19-2015 19:00:31
DOWN              3/3  (SOFT state)       06-19-2015 19:00:34     06-19-2015 19:01:31
DOWN              1/3  (HARD state)       06-19-2015 19:01:34     06-19-2015 19:04:31
DOWN              1/3  (HARD state)       06-19-2015 19:04:34     06-19-2015 19:09:31
UP                     1/3  (HARD state)       06-19-2015 19:09:35     06-19-2015 19:14:31
icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-06-23 13:26:15 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-08-17 18:47:40 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-08-18 15:01:14 +00:00

Generally speaking I'd like to split this issue, as it consists of two issues:

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-08-18 16:14:00 +00:00

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-08-18 16:48:33 +00:00

object Host NodeName + "3" {
  import NodeName
  check_command = "dummy"
  vars.dummy_text = "Last check: $host.last_check$ Last state change: $host.last_state_change$"
  vars.dummy_state = 0

  check_interval = 20m
  retry_interval = 5s
}

Apparently the fix does not work this way, host.next_check still stays at the old scheduled timestamp.

icinga-migration commented 9 years ago

Updated by mfriedrich on 2015-09-28 13:09:10 +00:00

icinga-migration commented 8 years ago

Updated by akrus on 2015-12-22 07:21:51 +00:00

I'm also experiencing the same issue here.

apply Service "WHOIS" {
  import "generic-service"

  zone = "master"

  check_command = "check_whois"

  groups += [ "whois" ]
  check_interval = 24h
  retry_interval = 1m

  vars.notification.mail.groups = [ "admins" ]
  vars.notification.line.groups = [ "line" ]

  assign where host.vars.domain
}

This results in checks being executes every 24 hours (retry interval doesn't work, while Icinga Web 2 shows it correctly).

icinga-migration commented 8 years ago

Updated by Peter_Story on 2016-01-14 16:02:47 +00:00

I'm also encountering this bug. I have a service with:

check_interval = 10m
retry_interval = 30s

After the first soft failure, the subsequent check is scheduled to run after 10 minutes, instead of 30 seconds.

icinga-migration commented 8 years ago

Updated by arothauf on 2016-01-20 09:34:21 +00:00

Can confirm this still happening. Should this bug not have a higher priority because of Notifications depending on it turning to hard state?

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-20 15:22:12 +00:00

It fairly should but we're all normal people, not machines.

I found another issue while testing the original patch which took me quite a while to figure out why it was happening.

Tests using the external command pipe

object Host "7287-host" {

  address = "1.2.3.4"
  check_command = "ping4"
  check_interval = 30m
  retry_interval = 10s
}

Now process check results using Icinga Web 2, from DOWN to UP and vice versa.

Calling UpdateNextCheck() works as it should, it contains the correct next GetRetryInterval() values and also sends that to the database.

There's one subsequent SetNextCheck() call afterwards which then overrides the value inside the core (and so the API and external interfaces).

[2016-01-20 15:43:42 +0100] warning/Checkable: Using next_check '1453301030.300000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453301030) WHERE host_object_id = 132255
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'ping4',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '1',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 1,  event_handler = '',  event_handler_enabled = '1',  execution_time = '0',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 132255,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1453301022),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1453298041),  last_state_change = FROM_UNIXTIME(1453301022),  last_time_down = FROM_UNIXTIME(1453301022),  last_time_up = FROM_UNIXTIME(1453300989),  latency = '0',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1453301030),  normal_check_interval = '30',  notifications_enabled = '1',  original_attributes = 'null',  output = 'sfsdfsfs',  passive_checks_enabled = '1',  percent_state_change = '0',  perfdata = '',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.166667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1453301022) WHERE host_object_id = 132255
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453302822) WHERE host_object_id = 132255

We do ensure to reschedule a check when processing a check result from the external command pipe or api actions. This is due to the reason we require freshness checks using the check_interval (see #7071 for better proposals) for incoming passive check results.

So not a good method for tests anyways. Temporarily removing the code blocks from externalcommandlistener.cpp proves the patch working.

API tests

[2016-01-20 16:18:46 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'ping4',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '1',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 1,  event_handler = '',  event_handler_enabled = '1',  execution_time = '10.018226861953735',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 132255,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1453303125),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1453303093),  last_state_change = FROM_UNIXTIME(1453303125),  last_time_down = FROM_UNIXTIME(1453303125),  last_time_up = FROM_UNIXTIME(1453303093),  latency = '0',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1453303132),  normal_check_interval = '30',  notifications_enabled = '1',  original_attributes = '{\"address\":\"8.8.8.8\"}',  output = 'CRITICAL - Plugin timed out after 10 seconds',  passive_checks_enabled = '1',  percent_state_change = '8',  perfdata = '',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.166667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1453303125) WHERE host_object_id = 132255
[2016-01-20 16:18:52 +0100] warning/Checkable: Using next_check '1453303142.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:18:53 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303142) WHERE host_object_id = 132255
[2016-01-20 16:19:02 +0100] warning/Checkable: Using next_check '1453303152.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:19:02 +0100] warning/Checkable: Using next_check '1453303152.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303152) WHERE host_object_id = 132255
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303152) WHERE host_object_id = 132255
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = 'ping4',  check_source = 'mbmif.int.netways.de',  check_type = '0',  current_check_attempt = '2',  current_notification_number = '0',  current_state = '1',  endpoint_object_id = 1,  event_handler = '',  event_handler_enabled = '1',  execution_time = '10.010746002197266',  flap_detection_enabled = '0',  has_been_checked = '1',  host_object_id = 132255,  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = FROM_UNIXTIME(1453303142),  last_hard_state = '0',  last_hard_state_change = FROM_UNIXTIME(1453303093),  last_state_change = FROM_UNIXTIME(1453303125),  last_time_down = FROM_UNIXTIME(1453303142),  last_time_up = FROM_UNIXTIME(1453303093),  latency = '0',  long_output = '',  max_check_attempts = '3',  next_check = FROM_UNIXTIME(1453303152),  normal_check_interval = '30',  notifications_enabled = '1',  original_attributes = '{\"address\":\"8.8.8.8\"}',  output = 'CRITICAL - Plugin timed out after 10 seconds',  passive_checks_enabled = '1',  percent_state_change = '8',  perfdata = '',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '0.166667',  scheduled_downtime_depth = '0',  should_be_scheduled = '1',  state_type = '0',  status_update_time = FROM_UNIXTIME(1453303142) WHERE host_object_id = 132255

$ curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/hosts/7287-host' -H "accept: application/json" -X POST -d '{ "attrs" : { "address": "1.2.3.4" }}'

Manual recheck will cause ProcessCheckResult() to immediately update the next check using the retry interval.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-20 15:30:03 +00:00

Applied in changeset a51e647cc760bd5f7c4de6182961a477478c11a9.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-22 16:12:08 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-22 16:31:21 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-22 17:37:02 +00:00

There's still one issue with passive check results - calling UpdateNextCheck() will first set a probably smaller next check time, and after invoking ProcessCheckResult() a separate call to SetNextCheck() is required to ensure the check_interval is used for passively received checks and their freshness.

This merely influences all checks where we do not care about them whether being active or passive.

It is also worse for active checks - the next check time is updated to check_interval if sent in as passive check result via API / UI.

Moving that task inside ProcessCheckResult() ensures that subsequent calls to SetNextCheck() happen with the correct check interval for passive check results. The active checks are then rescheduled based on their current state, especially if changed from sending a check result enforcing a state change, e.g. HARD OK, or SOFT NOT-OK with retry_interval being required.

Furthermore updating the next check time before sending a DB IDO update for OnNewCheckResult will also help reduce duplicated next_check queries (see linked #11019).

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-22 18:12:39 +00:00

Applied in changeset 2a11b27972e4325bf80e9abc9017eab7dd03e712.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-02-18 10:24:53 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-02-25 13:59:06 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-09 10:34:46 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-09 10:36:02 +00:00

The patch was broken in many ways, causing problems with actively scheduled checks, passive checks, etc. The original fix has been reverted, and a proper fix is discussed in #11336.