Closed icinga-migration closed 8 years ago
Updated by mfriedrich on 2014-10-11 17:18:26 +00:00
Updated by wimmer on 2014-11-23 13:33:59 +00:00
After 2 months I tried scheduling again. I use Debian packages version 2.2. Scheduler is still confused. Here is my finding:
Current Status Current Attempt Last Check Time Next Scheduled Active Check
OK 1/3 (HARD state) 11-23-2014 10:00:40 11-23-2014 10:05:40
CRITICAL 2/3 (SOFT state) 11-23-2014 10:05:40 11-23-2014 10:10:40
CRITICAL 3/3 (SOFT state) 11-23-2014 10:10:40 11-23-2014 10:11:40
CRITICAL 1/3 (HARD state) 11-23-2014 10:11:40 11-23-2014 10:12:40
CRITICAL 1/3 (HARD state) 11-23-2014 10:12:40 11-23-2014 10:15:40
Please can somebody confirm this bug?
I would like to use Icinga2 on my production systems, but is impossible with this strange behavior. Thanks for your work.
Updated by mfriedrich on 2014-11-24 13:38:29 +00:00
Updated by mfriedrich on 2015-01-26 15:05:56 +00:00
Updated by mfriedrich on 2015-02-20 20:32:16 +00:00
Please test the attached patch, untested.
Updated by schirrmie on 2015-03-05 07:25:22 +00:00
Hi, here are two issues. The first one is that the check attempt is wrong counted if state changed from soft to hard. Your attached patch worked fine for us and this issue is solved -> thank You! The second issue is that the retry_interval is not handled correctly, we have the same bug here with icinga 2.2.4. Here is an example from us:
max_check_attempts = 3
check_interval = 2m
retry_interval = 1m
state last next attempt
ok 07:49 07:51 1/3 H
critical 07:51 07:53 1/3 S # wrong
critical 07:53 07:54 2/3 S
critical 07:54 07:55 3/3 S
critical 07:55 07:56 1/3 H # wrong
critical 07:56 07:57 1/3 H # wrong
critical 07:57 07:59 1/3 H
ok 07:59 08:01 1/3 H
Your patch does not change the behaviour of the retry_interval. I hope this helps.
Regards schirrmie
Updated by mfriedrich on 2015-03-07 15:35:40 +00:00
Ok, thanks for testing. Then I'll leave this issue open for 2.4 and won't apply the patch for 2.3 for now. I don't want to break the release.
Updated by mfriedrich on 2015-03-11 15:34:03 +00:00
Updated by mfriedrich on 2015-04-08 14:28:41 +00:00
Ok, found the issue we've been discussing yesterday evening before the SIG NOC meeting :)
Updated by mfriedrich on 2015-05-18 13:11:00 +00:00
I'll need to dig into this a little deeper. I'm not sure if this might happen for 2.4.0 - feel free to submit any patches helping resolve the issue.
Updated by zearan on 2015-05-26 12:03:02 +00:00
The problem seems to lie in file libs/icinga/checkable-check.cpp. The function Checkable::ExecuteCheck() (line 521) executes the calculation of the next check before running the check itself:
void Checkable::ExecuteCheck()
{
[...]
UpdateNextCheck();
[...]
GetCheckCommand()->Execute(this, cr, NULL, false);
[...]
}
The time of the next check does not get re-evaluated by Checkable::ProcessCheckResult() (line 237) when the result is available. I have tried to fix it by a one-liner but I cannot say if this causes any unwanted side effects. At least it did not crash anything when I tried it.
Updated by wimmer on 2015-06-19 17:34:18 +00:00
I applied patch to 2.3.5 and it works great. Please insert it into master branch. Thank you very much for your work.
Here is my setup:
object Host "hhh" {
import "generic-host"
check_interval = 5m
retry_interval = 1m
max_check_attempts = 3
address = "hhh"
}
object Service "SMTP" {
import "generic-service"
check_interval = 5m
retry_interval = 1m
max_check_attempts = 3
host_name = "hhh"
check_command = "smtp"
}
and results:
Service SMTP:
==============
Current Status: Current Attempt: Last Check Time: Next Scheduled Active Check:
OK 1/3 (HARD state) 06-19-2015 18:27:43 06-19-2015 18:32:43
CRITICAL 2/3 (SOFT state) 06-19-2015 18:32:43 06-19-2015 18:33:43
CRITICAL 3/3 (SOFT state) 06-19-2015 18:33:43 06-19-2015 18:34:43
CRITICAL 1/3 (HARD state) 06-19-2015 18:34:43 06-19-2015 18:37:43
CRITICAL 1/3 (HARD state) 06-19-2015 18:37:43 06-19-2015 18:42:43
OK 1/3 (HARD state) 06-19-2015 18:42:43 06-19-2015 18:47:43
Host:
==========
Current Status: Current Attempt: Last Check Time: Next Scheduled Active Check:
UP 1/3 (HARD state) 06-19-2015 18:54:35 06-19-2015 18:59:31
DOWN 2/3 (SOFT state) 06-19-2015 18:59:34 06-19-2015 19:00:31
DOWN 3/3 (SOFT state) 06-19-2015 19:00:34 06-19-2015 19:01:31
DOWN 1/3 (HARD state) 06-19-2015 19:01:34 06-19-2015 19:04:31
DOWN 1/3 (HARD state) 06-19-2015 19:04:34 06-19-2015 19:09:31
UP 1/3 (HARD state) 06-19-2015 19:09:35 06-19-2015 19:14:31
Updated by mfriedrich on 2015-06-23 13:26:15 +00:00
Updated by mfriedrich on 2015-08-17 18:47:40 +00:00
Updated by mfriedrich on 2015-08-18 15:01:14 +00:00
Generally speaking I'd like to split this issue, as it consists of two issues:
Updated by mfriedrich on 2015-08-18 16:14:00 +00:00
Updated by mfriedrich on 2015-08-18 16:48:33 +00:00
object Host NodeName + "3" {
import NodeName
check_command = "dummy"
vars.dummy_text = "Last check: $host.last_check$ Last state change: $host.last_state_change$"
vars.dummy_state = 0
check_interval = 20m
retry_interval = 5s
}
Apparently the fix does not work this way, host.next_check still stays at the old scheduled timestamp.
Updated by mfriedrich on 2015-09-28 13:09:10 +00:00
Updated by akrus on 2015-12-22 07:21:51 +00:00
I'm also experiencing the same issue here.
apply Service "WHOIS" {
import "generic-service"
zone = "master"
check_command = "check_whois"
groups += [ "whois" ]
check_interval = 24h
retry_interval = 1m
vars.notification.mail.groups = [ "admins" ]
vars.notification.line.groups = [ "line" ]
assign where host.vars.domain
}
This results in checks being executes every 24 hours (retry interval doesn't work, while Icinga Web 2 shows it correctly).
Updated by Peter_Story on 2016-01-14 16:02:47 +00:00
I'm also encountering this bug. I have a service with:
check_interval = 10m
retry_interval = 30s
After the first soft failure, the subsequent check is scheduled to run after 10 minutes, instead of 30 seconds.
Updated by arothauf on 2016-01-20 09:34:21 +00:00
Can confirm this still happening. Should this bug not have a higher priority because of Notifications depending on it turning to hard state?
Updated by mfriedrich on 2016-01-20 15:22:12 +00:00
It fairly should but we're all normal people, not machines.
I found another issue while testing the original patch which took me quite a while to figure out why it was happening.
object Host "7287-host" {
address = "1.2.3.4"
check_command = "ping4"
check_interval = 30m
retry_interval = 10s
}
Now process check results using Icinga Web 2, from DOWN to UP and vice versa.
Calling UpdateNextCheck() works as it should, it contains the correct next GetRetryInterval() values and also sends that to the database.
There's one subsequent SetNextCheck() call afterwards which then overrides the value inside the core (and so the API and external interfaces).
[2016-01-20 15:43:42 +0100] warning/Checkable: Using next_check '1453301030.300000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453301030) WHERE host_object_id = 132255
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0', active_checks_enabled = '1', check_command = 'ping4', check_source = 'mbmif.int.netways.de', check_type = '0', current_check_attempt = '1', current_notification_number = '0', current_state = '1', endpoint_object_id = 1, event_handler = '', event_handler_enabled = '1', execution_time = '0', flap_detection_enabled = '0', has_been_checked = '1', host_object_id = 132255, instance_id = 1, is_flapping = '0', is_reachable = '1', last_check = FROM_UNIXTIME(1453301022), last_hard_state = '0', last_hard_state_change = FROM_UNIXTIME(1453298041), last_state_change = FROM_UNIXTIME(1453301022), last_time_down = FROM_UNIXTIME(1453301022), last_time_up = FROM_UNIXTIME(1453300989), latency = '0', long_output = '', max_check_attempts = '3', next_check = FROM_UNIXTIME(1453301030), normal_check_interval = '30', notifications_enabled = '1', original_attributes = 'null', output = 'sfsdfsfs', passive_checks_enabled = '1', percent_state_change = '0', perfdata = '', problem_has_been_acknowledged = '0', process_performance_data = '1', retry_check_interval = '0.166667', scheduled_downtime_depth = '0', should_be_scheduled = '1', state_type = '0', status_update_time = FROM_UNIXTIME(1453301022) WHERE host_object_id = 132255
[2016-01-20 15:43:43 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453302822) WHERE host_object_id = 132255
We do ensure to reschedule a check when processing a check result from the external command pipe or api actions. This is due to the reason we require freshness checks using the check_interval (see #7071 for better proposals) for incoming passive check results.
So not a good method for tests anyways. Temporarily removing the code blocks from externalcommandlistener.cpp proves the patch working.
[2016-01-20 16:18:46 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0', active_checks_enabled = '1', check_command = 'ping4', check_source = 'mbmif.int.netways.de', check_type = '0', current_check_attempt = '1', current_notification_number = '0', current_state = '1', endpoint_object_id = 1, event_handler = '', event_handler_enabled = '1', execution_time = '10.018226861953735', flap_detection_enabled = '0', has_been_checked = '1', host_object_id = 132255, instance_id = 1, is_flapping = '0', is_reachable = '1', last_check = FROM_UNIXTIME(1453303125), last_hard_state = '0', last_hard_state_change = FROM_UNIXTIME(1453303093), last_state_change = FROM_UNIXTIME(1453303125), last_time_down = FROM_UNIXTIME(1453303125), last_time_up = FROM_UNIXTIME(1453303093), latency = '0', long_output = '', max_check_attempts = '3', next_check = FROM_UNIXTIME(1453303132), normal_check_interval = '30', notifications_enabled = '1', original_attributes = '{\"address\":\"8.8.8.8\"}', output = 'CRITICAL - Plugin timed out after 10 seconds', passive_checks_enabled = '1', percent_state_change = '8', perfdata = '', problem_has_been_acknowledged = '0', process_performance_data = '1', retry_check_interval = '0.166667', scheduled_downtime_depth = '0', should_be_scheduled = '1', state_type = '0', status_update_time = FROM_UNIXTIME(1453303125) WHERE host_object_id = 132255
[2016-01-20 16:18:52 +0100] warning/Checkable: Using next_check '1453303142.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:18:53 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303142) WHERE host_object_id = 132255
[2016-01-20 16:19:02 +0100] warning/Checkable: Using next_check '1453303152.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:19:02 +0100] warning/Checkable: Using next_check '1453303152.940000' for checkable '7287-host' in state type '0' and state '2'.
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303152) WHERE host_object_id = 132255
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET next_check = FROM_UNIXTIME(1453303152) WHERE host_object_id = 132255
[2016-01-20 16:19:03 +0100] debug/IdoMysqlConnection: Query: UPDATE icinga_hoststatus SET acknowledgement_type = '0', active_checks_enabled = '1', check_command = 'ping4', check_source = 'mbmif.int.netways.de', check_type = '0', current_check_attempt = '2', current_notification_number = '0', current_state = '1', endpoint_object_id = 1, event_handler = '', event_handler_enabled = '1', execution_time = '10.010746002197266', flap_detection_enabled = '0', has_been_checked = '1', host_object_id = 132255, instance_id = 1, is_flapping = '0', is_reachable = '1', last_check = FROM_UNIXTIME(1453303142), last_hard_state = '0', last_hard_state_change = FROM_UNIXTIME(1453303093), last_state_change = FROM_UNIXTIME(1453303125), last_time_down = FROM_UNIXTIME(1453303142), last_time_up = FROM_UNIXTIME(1453303093), latency = '0', long_output = '', max_check_attempts = '3', next_check = FROM_UNIXTIME(1453303152), normal_check_interval = '30', notifications_enabled = '1', original_attributes = '{\"address\":\"8.8.8.8\"}', output = 'CRITICAL - Plugin timed out after 10 seconds', passive_checks_enabled = '1', percent_state_change = '8', perfdata = '', problem_has_been_acknowledged = '0', process_performance_data = '1', retry_check_interval = '0.166667', scheduled_downtime_depth = '0', should_be_scheduled = '1', state_type = '0', status_update_time = FROM_UNIXTIME(1453303142) WHERE host_object_id = 132255
$ curl -k -s -u root:icinga 'https://localhost:5665/v1/objects/hosts/7287-host' -H "accept: application/json" -X POST -d '{ "attrs" : { "address": "1.2.3.4" }}'
Manual recheck will cause ProcessCheckResult() to immediately update the next check using the retry interval.
Updated by mfriedrich on 2016-01-20 15:30:03 +00:00
Applied in changeset a51e647cc760bd5f7c4de6182961a477478c11a9.
Updated by mfriedrich on 2016-01-22 16:12:08 +00:00
Updated by mfriedrich on 2016-01-22 16:31:21 +00:00
Updated by mfriedrich on 2016-01-22 17:37:02 +00:00
There's still one issue with passive check results - calling UpdateNextCheck() will first set a probably smaller next check time, and after invoking ProcessCheckResult() a separate call to SetNextCheck() is required to ensure the check_interval is used for passively received checks and their freshness.
This merely influences all checks where we do not care about them whether being active or passive.
It is also worse for active checks - the next check time is updated to check_interval if sent in as passive check result via API / UI.
Moving that task inside ProcessCheckResult() ensures that subsequent calls to SetNextCheck() happen with the correct check interval for passive check results. The active checks are then rescheduled based on their current state, especially if changed from sending a check result enforcing a state change, e.g. HARD OK, or SOFT NOT-OK with retry_interval being required.
Furthermore updating the next check time before sending a DB IDO update for OnNewCheckResult will also help reduce duplicated next_check queries (see linked #11019).
Updated by mfriedrich on 2016-01-22 18:12:39 +00:00
Applied in changeset 2a11b27972e4325bf80e9abc9017eab7dd03e712.
Updated by mfriedrich on 2016-02-18 10:24:53 +00:00
Updated by mfriedrich on 2016-02-25 13:59:06 +00:00
Updated by mfriedrich on 2016-03-09 10:34:46 +00:00
Updated by mfriedrich on 2016-03-09 10:36:02 +00:00
The patch was broken in many ways, causing problems with actively scheduled checks, passive checks, etc. The original fix has been reverted, and a proper fix is discussed in #11336.
This issue has been migrated from Redmine: https://dev.icinga.com/issues/7287
Created by wimmer on 2014-09-20 15:16:39 +00:00
Assignee: mfriedrich Status: Resolved (closed on 2016-01-22 18:12:39 +00:00) Target Version: 2.4.2 Last Update: 2016-03-09 10:36:02 +00:00 (in Redmine)
I found strange behavior of re-checks scheduling. I use Icinga1 and Icinga2 Debian packages and I use the same attributes for checks:
When some service changes its status, Icinga1 works as expected:
But Icinga2 uses strange Attempt numbering [there is no 1/3 (SOFT state)] and bad time for scheduling of first re-check (15:10 instead of 15:06). It looks like Icinga2 misses "1/3 SOFT state" phase. There is wrong scheduling in CRITICAL/HARD states too (compare with Icinga1 times):
Attachments
Changesets
2015-03-11 15:33:36 +00:00 by mfriedrich 05c237c780d4cdbce4ad844febec0d012cf7796f
2015-08-21 08:24:49 +00:00 by mfriedrich 6f252bb5c5ac143577a3a19db64428d013480e7d
2016-01-20 15:29:01 +00:00 by mfriedrich a51e647cc760bd5f7c4de6182961a477478c11a9
2016-01-22 17:40:14 +00:00 by mfriedrich 2a11b27972e4325bf80e9abc9017eab7dd03e712
2016-02-23 08:24:57 +00:00 by mfriedrich 9ca7245c188c2cb57d0773024e339ad59ad08548
2016-03-05 17:15:03 +00:00 by mfriedrich b8e3d61820756066dae5ba9f1e727afdc2d4801c
2016-03-05 17:16:49 +00:00 by mfriedrich ef532f20eb4c168dc532d7f9532d88409a821902
2016-03-11 14:55:03 +00:00 by mfriedrich 8344f749cdcbcb5fc94cdb6436a9004e0a54d6a2
2016-03-11 14:55:14 +00:00 by mfriedrich f99feab08c8929b8e2db8a49689b8f09bd3c3448
Relations: