Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

[dev.icinga.com #11215] Scheduled flexible service downtime is not applied #3965

Closed icinga-migration closed 8 years ago

icinga-migration commented 8 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11215

Created by greatexpectations on 2016-02-24 10:35:51 +00:00

Assignee: (none) Status: Closed (closed on 2016-06-23 14:59:27 +00:00) Target Version: (none) Last Update: 2016-06-23 14:59:38 +00:00 (in Redmine)

Icinga Version: 2.4.1
Backport?: Not yet backported
Include in Changelog: 1

Greetings,

we generate some scheduled downtime objects by way of apply rules. For a particular service, a scheduled downtime object is generated like this (output of icinga2 object list -t scheduleddowntime -n 'myhostname!*'):

Object 'myhostname!myservice!Tägl. Neustart' of type 'ScheduledDowntime':
  * __name = "myhostname!myservice!Tägl. Neustart"
  * author = "myuser"
  * comment = "Neustart Service während Sicherung, siehe Ticket #xxxxxx"
  * duration = 5400
  * fixed = false
  * host_name = "myhostname"
  * name = "Tägl. Neustart"
  * package = "_etc"
  * ranges
    * friday = "00:00-04:00"
    * monday = "00:00-04:00"
    * saturday = "00:00-04:00"
    * sunday = "00:00-04:00"
    * thursday = "00:00-04:00"
    * tuesday = "00:00-04:00"
    * wednesday = "00:00-04:00"
  * service_name = "myservice"
  * templates = [ "Tägl. Neustart" ]
  * type = "ScheduledDowntime"
  * vars (omitted for brevity)
  * zone = "myzone"

However, the next day I found that despite of all of this, alerts were still issued for the service. There were no prior outages of the service in the downtime window.

Excerpt from icinga2.log:

...
[2016-02-23 02:13:51 +0100] information/Checkable: Checking for configured notifications for object 'myhostname!myservice'
[2016-02-23 02:13:51 +0100] information/Notification: Sending notification 'myhostname!myservice!helpdesk-Service-Notification' for user 'helpdesk'
[2016-02-23 02:13:51 +0100] information/Notification: Completed sending notification 'myhostname!myservice!helpdesk-Service-Notification' for checkable 'myhostname!myservice'
...

A database query shows that the downtime was not started:

select entry_time, is_fixed, duration, scheduled_start_time, scheduled_end_time, was_started from icinga_downtimehistory where downtimehistory_id=222;

entry_time => 2016-02-23 09:43:57
is_fixed => 0
duration => 5400
scheduled_start_time => 2016-02-24 00:00:00
scheduled_end_time => 2016-02-24 04:00:00
was_started => 0

With kind regards

icinga-migration commented 8 years ago

Updated by greatexpectations on 2016-02-24 10:40:52 +00:00

Oops, I included the wrong excerpt from icinga2.log (from yesterday). The log looks the same for today, though...

...
[2016-02-24 02:13:13 +0100] information/Checkable: Checking for configured notifications for object 'myhostname!myservice'
[2016-02-24 02:13:13 +0100] information/Notification: Sending notification 'myhostname!myservice!helpdesk-Service-Notification' for user 'helpdesk'
[2016-02-24 02:13:13 +0100] information/Notification: Completed sending notification 'myhostname!myservice!helpdesk-Service-Notification' for checkable 'myhostname!myservice'
...
icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-02-24 19:27:56 +00:00

Could please you enable the debug log and extract 1) the surrounding context for triggering the downtime 2) the notification is sent (it is evaluated against all possible filters).

icinga-migration commented 8 years ago

Updated by jeunito on 2016-02-24 21:48:39 +00:00

+1 Getting this issue too.

icinga-migration commented 8 years ago

Updated by jeunito on 2016-02-24 23:56:52 +00:00

In my case, I use the external command pipe on a master node. The downtime does make it to the master node but does not get propagated to the satellite nodes.

icinga-migration commented 8 years ago

Updated by greatexpectations on 2016-02-25 08:01:21 +00:00

Thank you for looking into this so quickly.

I programmatically enabled the debug log for today from 02:00:00 to 04:00:00 and collected 1.2 GB of logging data (lol).

Unfortunately I have some difficulty in interpreting the log contents, maybe you can shed some light on this? All of the log entries below are presented in order.

First log entry is the first failed service check at 02:05:

[2016-02-25 02:05:05 +0100] notice/Process: PID 27127 ('/redacted/path/check_snmp_win.pl' '-C' 'REDACTED' '-H' '192.168.97.80' '-n' '^.*Redacted Service Name.*$' '-t' '30') terminated with exit code 2

Then it tells me that a downtime was triggered. I guess the next "Not triggering downtime..." message refers to the scheduleddowntime object for the day before or after, but I could be wrong.

[2016-02-25 02:05:05 +0100] notice/Downtime: Triggering downtime 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!redacted.host.name-1456268415-0'.
[2016-02-25 02:05:05 +0100] debug/DbEvents: add log entry history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:05:05 +0100] debug/Downtime: Not triggering downtime 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!redacted.host.name-1456354848-0': current time is outside downtime window.

The DB update log entries seem to indicate that the downtime was triggered correctly.

[2016-02-25 02:05:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_scheduleddowntime SET actual_start_time = TO_TIMESTAMP(1456362305),  actual_start_time_usec = E'889934',  instance_id = 1,  is_in_effect = E'1',  trigger_time = TO_TIMESTAMP(1456362185),  was_started = E'1' WHERE internal_downtime_id = E'13' AND object_id = 5682
[2016-02-25 02:05:05 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2016-02-25 02:05:05 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2016-02-25 02:05:05 +0100] debug/DbEvents: add log entry history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:05:05 +0100] debug/DbEvents: add checkable check history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:05:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_downtimehistory SET actual_start_time = TO_TIMESTAMP(1456362305),  actual_start_time_usec = E'889934',  is_in_effect = E'1',  trigger_time = TO_TIMESTAMP(1456362185),  was_started = E'1' WHERE entry_time = TO_TIMESTAMP(1456268415) AND instance_id = 1 AND internal_downtime_id = E'13' AND scheduled_end_time = TO_TIMESTAMP(1456369200) AND scheduled_start_time = TO_TIMESTAMP(1456354800)
[2016-02-25 02:05:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET scheduled_downtime_depth = E'0' WHERE instance_id = 1 AND service_object_id = 5682
[2016-02-25 02:05:05 +0100] debug/IdoPgsqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (1, TO_TIMESTAMP(1456362305), E'889987', 1, E'SERVICE DOWNTIME ALERT: redacted.host.name;Service-SNMP-Windows-Service-Redacted Service Name;STARTED; Service has entered a period of scheduled downtime.', TO_TIMESTAMP(1456362305), E'262144',5682)

The strange part comes with the next check iteration at 02:06.

It seems that Icinga2 reschedules the downtime (why?). It tells me that the hard service state has changed from CRITICAL to CRITICAL, and promptly begins to issue a notification.

[2016-02-25 02:06:05 +0100] notice/Process: PID 27373 ('/redacted/path//plugins/check_snmp_win.pl' '-C' 'public' '-H' '192.168.97.80' '-n' '^.*Redacted Service Name.*$' '-t' '30') terminated with exit code 2
[2016-02-25 02:06:05 +0100] notice/Downtime: Triggering downtime 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!redacted.host.name-1456268415-0'.
[2016-02-25 02:06:05 +0100] debug/DbEvents: add log entry history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_scheduleddowntime SET actual_start_time = TO_TIMESTAMP(1456362365),  actual_start_time_usec = E'900192',  instance_id = 1,  is_in_effect = E'1',  trigger_time = TO_TIMESTAMP(1456362185),  was_started = E'1' WHERE internal_downtime_id = E'13' AND object_id = 5682
[2016-02-25 02:06:05 +0100] debug/Downtime: Not triggering downtime 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!redacted.host.name-1456354848-0': current time is outside downtime window.
[2016-02-25 02:06:05 +0100] debug/DbEvents: add log entry history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] debug/DbEvents: add checkable check history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] debug/DbEvents: add state change history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] notice/Checkable: State Change: Checkable redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name hard state change from CRITICAL to CRITICAL detected.
[2016-02-25 02:06:05 +0100] information/Checkable: Checking for configured notifications for object 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] debug/Checkable: Checkable 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name' has 1 notification(s).
[2016-02-25 02:06:05 +0100] notice/Notification: Attempting to send notifications for notification object 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!OTRS-Service-Notification'.
[2016-02-25 02:06:05 +0100] debug/Notification: Type 'Problem', TypeFilter OK, Warning, Unknown, Up, Down, DowntimeEnd, DowntimeRemoved, Custom, Acknowledgement, Problem, Recovery, FlappingStart and FlappingEnd (FType=32, TypeFilter=-1)
[2016-02-25 02:06:05 +0100] debug/Notification: State 'Critical', StateFilter OK, Warning, Unknown, Up, Down, DowntimeEnd, DowntimeRemoved, Custom, Acknowledgement, Problem, Recovery, FlappingStart and FlappingEnd (FState=4, StateFilter=-1)
[2016-02-25 02:06:05 +0100] debug/Notification: User notification, Type 'Problem', TypeFilter Unknown, Up, Down and FlappingEnd (FType=32, TypeFilter=-1)
[2016-02-25 02:06:05 +0100] debug/Notification: User notification, State 'Critical', StateFilter OK, Warning, Unknown, Up, Down, DowntimeEnd, DowntimeRemoved, Custom, Acknowledgement, Problem, Recovery, FlappingStart and FlappingEnd (FState=4, StateFilter=-1)
[2016-02-25 02:06:05 +0100] information/Notification: Sending notification 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!OTRS-Service-Notification' for user 'OTRS'
[2016-02-25 02:06:05 +0100] debug/DbEvents: add notification history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2016-02-25 02:06:05 +0100] debug/DbEvents: add contact notification history for service 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name' and user 'OTRS'.
[2016-02-25 02:06:05 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2016-02-25 02:06:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_downtimehistory SET actual_start_time = TO_TIMESTAMP(1456362365),  actual_start_time_usec = E'900192',  is_in_effect = E'1',  trigger_time = TO_TIMESTAMP(1456362185),  was_started = E'1' WHERE entry_time = TO_TIMESTAMP(1456268415) AND instance_id = 1 AND internal_downtime_id = E'13' AND scheduled_end_time = TO_TIMESTAMP(1456369200) AND scheduled_start_time = TO_TIMESTAMP(1456354800)
[2016-02-25 02:06:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET scheduled_downtime_depth = E'0' WHERE instance_id = 1 AND service_object_id = 5682
[2016-02-25 02:06:05 +0100] debug/IdoPgsqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (1, TO_TIMESTAMP(1456362365), E'900265', 1, E'SERVICE DOWNTIME ALERT: redacted.host.name;Service-SNMP-Windows-Service-Redacted Service Name;STARTED; Service has entered a period of scheduled downtime.', TO_TIMESTAMP(1456362365), E'262144', 5682)
[2016-02-25 02:06:05 +0100] debug/IdoPgsqlConnection: Query: INSERT INTO icinga_logentries (endpoint_object_id, entry_time, entry_time_usec, instance_id, logentry_data, logentry_time, logentry_type, object_id) VALUES (1, TO_TIMESTAMP(1456362365), E'900386', 1, E'SERVICE ALERT: redacted.host.name;Service-SNMP-Windows-Service-Redacted Service Name;CRITICAL;HARD;1;No services matching "^.*Redacted Service Name.*$" found : CRITICAL', TO_TIMESTAMP(1456362365), E'65536', 5682)
[2016-02-25 02:06:05 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'Command-SNMP-Manubolon-Windows-Service',  check_source = E'redacted.host.name',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'2',  endpoint_object_id = 1,  event_handler = E'',  event_handler_enabled = E'1',  execution_time = E'0.18781900405883789',  flap_detection_enabled = E'0',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1456362365),  last_hard_state = E'2',  last_hard_state_change = TO_TIMESTAMP(1456362365),  last_notification = TO_TIMESTAMP(1456282873),  last_state_change = TO_TIMESTAMP(1456362185),  last_time_critical = TO_TIMESTAMP(1456362365),  last_time_ok = TO_TIMESTAMP(1456361748),  last_time_unknown = TO_TIMESTAMP(1455909756),  latency = E'0',  long_output = E'',  max_check_attempts = E'3',  next_check = TO_TIMESTAMP(1456362485),  next_notification = TO_TIMESTAMP(1456362362),  normal_check_interval = E'5',  notifications_enabled = E'1',  original_attributes = E'null',  output = E'No services matching "^.*Redacted Service Name.*$" found : CRITICAL',  passive_checks_enabled = E'1',  percent_state_change = E'0',  perfdata = E'',  problem_has_been_acknowledged = E'0',  process_performance_data = E'1',  retry_check_interval = E'1',  scheduled_downtime_depth = E'0',  service_object_id = 5682,  should_be_scheduled = E'1',  state_type = E'1',  status_update_time = TO_TIMESTAMP(1456362365) WHERE service_object_id = 5682
[2016-02-25 02:06:05 +0100] notice/Process: Running command '/etc/icinga2/scripts/notify-otrs': PID 27374
[2016-02-25 02:06:05 +0100] debug/DbEvents: add log entry history for 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'
[2016-02-25 02:06:05 +0100] information/Notification: Completed sending notification 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!OTRS-Service-Notification' for checkable 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name'

Can you make some sense of all that?

With kind regards

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-18 11:11:58 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-04-22 08:38:46 +00:00

ScheduledDowntime work in the way that it only schedules one Downtime per day. Once the downtime is expired Icinga 2 will schedule one for the next day.

That is somewhat strange though.

[2016-02-25 02:06:05 +0100] debug/Downtime: Not triggering downtime 'redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!redacted.host.name-1456354848-0': current time is outside downtime window.

It would be interesting how that Downtime object looks like at that time (e.g. query the REST API for "/v1/objects/downtimes/redacted.host.name!Service-SNMP-Windows-Service-Redacted Service Name!redacted.host.name-1456354848-0")

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-05-02 13:40:55 +00:00

icinga-migration commented 8 years ago

Updated by greatexpectations on 2016-06-23 14:46:55 +00:00

Sorry for the late response. This problem has been fixed with commit https://dev.icinga.org/projects/i2/repository/revisions/0b95be7b9904c9a092fd256c0d07b1057870ebc5

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-06-23 14:59:27 +00:00

Ah yes, thanks for the feedback :)

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-06-23 14:59:38 +00:00