Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Icinga statehistory not always updated with recovery #5702

Closed NielsH closed 6 years ago

NielsH commented 7 years ago

Hello!

Originally posted here: https://monitoring-portal.org/index.php?thread/42230-icinga-statehistory-not-always-updated-with-recovery/

I was directed to create an issue here. I will re-post the issue I have:

We want to read the icinga_statehistory directly from the icinga ido database into our own application. However, we are noticing an issue here, we do not always see a hard state recovery.

+-----------------+-------------+---------------------+-----------------+-----------+--------------+-------+------------+-----------------------+--------------------+------------+-----------------+-------------------------------------------+-------------+------------------------------+--------------------+
| statehistory_id | instance_id | state_time          | state_time_usec | object_id | state_change | state | state_type | current_check_attempt | max_check_attempts | last_state | last_hard_state | output                                    | long_output | check_source              | endpoint_object_id |
+-----------------+-------------+---------------------+-----------------+-----------+--------------+-------+------------+-----------------------+--------------------+------------+-----------------+-------------------------------------------+-------------+------------------------------+--------------------+
|           40016 |           4 | 2017-10-18 17:48:36 |           27322 |     18947 |            1 |     0 |          1 |                     1 |                  3 |          3 |               0 | PING OK - Packet loss = 0%, RTA = 0.90 ms |             | icinga01 |                616 |
|          111520 |           4 | 2017-10-25 16:16:55 |          527600 |     18947 |            1 |     2 |          0 |                     1 |                  3 |          0 |               0 | PING CRITICAL - Packet loss = 100%        |             | icinga02 |                610 |
|          111854 |           4 | 2017-10-25 16:17:15 |          277616 |     18947 |            1 |     2 |          0 |                     2 |                  3 |          2 |               0 | PING CRITICAL - Packet loss = 100%        |             | icinga02 |                610 |
|          112142 |           4 | 2017-10-25 16:17:35 |          273171 |     18947 |            1 |     2 |          1 |                     1 |                  3 |          2 |               2 | PING CRITICAL - Packet loss = 100%        |             | icinga02 |                610 |
+-----------------+-------------+---------------------+-----------------+-----------+--------------+-------+------------+-----------------------+--------------------+------------+-----------------+-------------------------------------------+-------------+------------------------------+--------------------+

Please see above for the ping4 check for example. We do see the hard state change for the ping check, but while the check currently reports OK again, it is never logged into the icinga_statehistory table.

This is my query:

select * from icinga_statehistory WHERE `object_id` = '18947';

The object_id is retriev like this:

select * from icinga_objects WHERE `object_id` = '18947';
+-----------+-------------+---------------+---------------+-------+-----------+
| object_id | instance_id | objecttype_id | name1         | name2 | is_active |
+-----------+-------------+---------------+---------------+-------+-----------+
|     18947 |           4 |             2 | <hidden>      | ping4 |         1 |
+-----------+-------------+---------------+---------------+-------+-----------+

Please see below a screenshot of Icingaweb. As you can see, even though there definitely was a hard state change, it currently shows the service as "Up", and it also says "Ok since 18 oct" which is incorrect:

image

I am not sure what exactly causes it sadly, it does not happen all the time; when we tested a manual downtime things work as expected. Of course, when something goes down unexpectedly things are a bit different, and this is how it is reported now...

Your Environment

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Application information: Installation root: /usr Sysconf directory: /etc Run directory: /run Local state directory: /var Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid

System information: Platform: Debian GNU/Linux Platform version: 9 (stretch) Kernel: Linux Kernel version: 4.9.0-3-amd64 Architecture: x86_64

Build information: Compiler: GNU 6.3.0 Build host: f42b8d72881d

* Operating System and version: Debian 9 64-bit
* Enabled features (`icinga2 feature list`):

Disabled features: command compatlog debuglog gelf graphite influxdb livestatus opentsdb perfdata statusdata syslog Enabled features: api checker ido-mysql mainlog notification


* Icinga Web 2 version and modules (System - About): 2.4.2 - git commit 84d2cd05618146132064e11c268642fd3c24de63
Monitoring module loaded

* Config validation (`icinga2 daemon -C`):

information/cli: Icinga application loader (version: r2.7.1-1) information/cli: Loading configuration file(s). information/ConfigItem: Committing config item(s). information/ApiListener: My API identity: warning/ApplyRule: Apply rule 'apt' (in /etc/icinga2/zones.d/global-templates/apt.conf: 3:1-3:19) for type 'Service' does not match anywhere! information/ConfigItem: Instantiated 1 ApiUser. information/ConfigItem: Instantiated 1 ApiListener. information/ConfigItem: Instantiated 254 Zones. information/ConfigItem: Instantiated 1 FileLogger. information/ConfigItem: Instantiated 252 Endpoints. information/ConfigItem: Instantiated 2 NotificationCommands. information/ConfigItem: Instantiated 12043 Notifications. information/ConfigItem: Instantiated 219 CheckCommands. information/ConfigItem: Instantiated 252 Hosts. information/ConfigItem: Instantiated 1 IcingaApplication. information/ConfigItem: Instantiated 3 TimePeriods. information/ConfigItem: Instantiated 2 Users. information/ConfigItem: Instantiated 5838 Services. information/ConfigItem: Instantiated 1 CheckerComponent. information/ConfigItem: Instantiated 1 NotificationComponent. information/ConfigItem: Instantiated 1 IdoMysqlConnection. information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' information/cli: Finished validating the configuration file(s).


* If you run multiple Icinga 2 instances, the `zones.conf` file (or `icinga2 object list --type Endpoint` and `icinga2 object list --type Zone`) from all affected nodes.
Well, this has over 5000 lines... However, see below for the one that I am currently reporting the issue for:

Object '' of type 'Endpoint': % declared in '/etc/icinga2/zones.d/master/.conf', lines 10:1-10:31

Object '<hidden>' of type 'Zone':
  % declared in '/etc/icinga2/zones.d/master/<hidden>.conf', lines 4:1-4:27
  * __name = "<hidden>"
  * endpoints = [ "<hidden>" ]
    % = modified in '/etc/icinga2/zones.d/master/<hidden>.conf', lines 5:3-5:32
  * global = false
    % = modified in '/etc/icinga2/zones.d/master/<hidden>.conf', lines 6:3-6:16
  * name = "<hidden>"
  * package = "_etc"
  * parent = "master"
    % = modified in '/etc/icinga2/zones.d/master/<hidden>.conf', lines 7:3-7:19
  * source_location
    * first_column = 1
    * first_line = 4
    * last_column = 27
    * last_line = 4
    * path = "/etc/icinga2/zones.d/master/<hidden>.conf"
  * templates = [ "<hidden>" ]
    % = modified in '/etc/icinga2/zones.d/master/<hidden>.conf', lines 4:1-4:27
  * type = "Zone"
  * zone = "master"

Thank you!

NielsH commented 7 years ago

We still have this problem, but I'm having difficulties getting reproducible steps.

We have a setup with 2 icinga master nodes monitoring multiple servers across 2 DCs. We noticed that this happens when an entire DC is having some issues, a DC where an Icinga master node is in as well. So during the time things are down, one of the icinga master nodes is down as well.

Could it be caused by a scenario like this? :

  1. DC1 is having issues
  2. monitor2 sees this and takes over the DB-IDO connection from monitor1.
  3. monitor2 sees client1 going down and sends a notification
  4. DC1 comes back up and takes over the DB-IDO connection from monitor2
  5. the first check performed by monitor1 to client1. The state is "UP" again. Because monitor1 does not know monitor2 sent a notification (because the communication between monitor1 / monitor 2 was down as well) no notification is sent out that the service is UP again.

Is something like that possible? Or is the entire notification logic/ state kept in the IDO DB and not within the icinga2 daemon itself communicated through the API connection between them?

Cheers!

dnsmichi commented 6 years ago

The Icinga Web 2 history tab combines many events, not only those from the statehistory table. The notifications table also is taken into account.

Still, I have no idea why the service state would be marked as UP. Unfortunately the screenshot hides the content, but I would say this entry is of the type "Notification", and not an OK recovery.

In terms of the recovery entry in the statehistory table - is there a corresponding icinga2.log entry for such a state change which would trigger a notification later on? This might need the debuglog feature enabled. I don't care about the notification fired afterwards, only the state change detection is important.

Anyways. I'd suggest to upgrade to 2.5.x for Icinga Web 2 and 2.8.x for Icinga 2 and test again.

dnsmichi commented 6 years ago

Did you find anything? Otherwise I'll close here.