Multi-master multi-zone sync-loop causing active checks getting stale

Z3po commented 6 years ago

When working with a multi-master setup and having multiple zones a config-sync loop (hickup?) will result in stale checks which go unrecognized

Our Setup:

Zone 1	Zone 2	Zone 3
Master1, Master2	Master3, Master4
Zone1Satellite1, Zone1Satellite2	Zone2Satellite1, Zone2Satellite2	Zone3Satellite1, Zone3Satellite2
Node1, Node2, Node3....	Node100, Node101, Node103...	Node201, Node202, Node203....

we are using that Setup to monitor ~ 110k Services on ~3,5k Hosts. The Master Zone is replicating it's data to a postgres database and the results are visualized via icingaweb2. HA-Flag is enabled (all masters share the same database). On top of that setup we are using icinga director to create new hosts and services.

The issue appeared when we acknowledged service issues on ~ 400 hosts in bulk on 2018-06-26 ~ 17:00.

While recognizing something is wrong the masters have been restarted on 2018-06-26 ~ 00:26 and one of the satellite daemon was restarted on 2018-06-27 12:00.

Expected Behaviour

The acknowledgments and comments are being synced smoothly and checks continue being executed in time (like they should)

Current Behavior

The config-sync down to the satellite zones seems to be stuck in a config-sync loop. There's no indication of what's going wrong just just loglines like the following over and over again:

[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.

In numbers:

# zgrep -h 'Triggering Start signal for config items' icinga2.log* | cut -c -14 | sort | uniq -c
     70 [2018-06-25 10
     63 [2018-06-25 15
      5 [2018-06-26 09
    131 [2018-06-26 11
     35 [2018-06-26 13
    107 [2018-06-26 14
    111 [2018-06-26 15
      4 [2018-06-26 16
  99452 [2018-06-26 17
 113210 [2018-06-26 18
 112075 [2018-06-26 19
 113379 [2018-06-26 20
 113253 [2018-06-26 21
 113955 [2018-06-26 22
 114771 [2018-06-26 23
  96737 [2018-06-27 00
  62995 [2018-06-27 01
  63233 [2018-06-27 02
  63006 [2018-06-27 03
  63074 [2018-06-27 04
  62947 [2018-06-27 05
  63091 [2018-06-27 06
  62842 [2018-06-27 07
  62739 [2018-06-27 08
  63225 [2018-06-27 09
  63145 [2018-06-27 10
  63013 [2018-06-27 11
    693 [2018-06-27 12
     31 [2018-06-27 13
    147 [2018-06-27 14
     53 [2018-06-27 15
      1 [2018-06-27 17
      1 [2018-06-28 10
    115 [2018-06-28 13
     39 [2018-06-28 14
      3 [2018-06-28 15
      2 [2018-06-28 18
     35 [2018-06-29 09
      1 [2018-06-29 17
    219 [2018-07-02 10

When checking the filesystem it appeared that the comments have only partly been synced. Deleting one of the available comments caused them just to re-appear during the config-sync loop.

Additionally all checks depending on the satellites (icinga agents are connected to the satellites in their specific zone) are getting stale and are staying in an old state. There's no indication that anything is going wrong the icingaweb instance only tells me that checks should have been executed since a day or more.

Possible Solution

The issue was partly solved by restarting the satellites in that specific zone. Deleting the already synced comments and restarting the daemon again fully solved it (at least then the comments locally available in /var/lib/icinga2/api/packages/_api/Zone2Satellite1-12345678-1/conf.d/comments matched the ones from the master instances.

Steps to Reproduce (for bugs)

I've not (yet) been able to reproduce the issue. It happens from time to time.

Context

Like stated in "Current Behaviour" checks are stale without further notice as the satellite daemons are stuck in a config-sync loop and do not schedule checks anymore. The Icinga cluster or icinga service checks do not indicate an issue. I'm not aware of any other health-check which shows the erroneous behaviour.

Your Environment

Version used (icinga2 --version): icinga2 - The Icinga 2 network monitoring daemon (version: r2.8.4-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

Application information: Installation root: /usr Sysconf directory: /etc Run directory: /run Local state directory: /var Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid

System information: Platform: Debian GNU/Linux Platform version: 8 (jessie) Kernel: Linux Kernel version: 3.16.0-5-amd64 Architecture: x86_64

Build information: Compiler: GNU 4.9.2 Build host: 289f70f9cece
Operating System and version: Debian GNU/Linux 8 \n \l
Enabled features (icinga2 feature list): Disabled features: debuglog elasticsearch gelf icingastatus influxdb opentsdb perfdata syslog Enabled features: api checker command compatlog graphite ido-pgsql livestatus mainlog notification statusdata
Icinga Web 2 version and modules (System - About): director 1.4.3 monitoring 2.5.3
Config validation (icinga2 daemon -C): information/cli: Icinga application loader (version: r2.8.4-1) information/cli: Loading configuration file(s). information/ConfigItem: Committing config item(s). warning/Zone: The Zone object 'master' has more than two endpoints. Due to a known issue this type of configuration is strongly discouraged and may cause Icinga to use excessive amounts of CPU time. warning/Zone: The Zone object 'external-checks' has more than two endpoints. Due to a known issue this type of configuration is strongly discouraged and may cause Icinga to use excessive amounts of CPU time. warning/ApiListener: Attribute 'key_path' for object 'api' of type 'ApiListener' is deprecated and should not be used. warning/ApiListener: Attribute 'ca_path' for object 'api' of type 'ApiListener' is deprecated and should not be used. warning/ApiListener: Attribute 'cert_path' for object 'api' of type 'ApiListener' is deprecated and should not be used. warning/ApiListener: Please read the upgrading documentation for v2.8: https://www.icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/ information/ApiListener: My API identity: mmonitor-master-eu001.server.lan information/ConfigItem: Instantiated 1 ApiListener. information/ConfigItem: Instantiated 3292 Zones. information/ConfigItem: Instantiated 3300 Endpoints. information/ConfigItem: Instantiated 1 FileLogger. information/ConfigItem: Instantiated 2 ApiUsers. information/ConfigItem: Instantiated 44938 Notifications. information/ConfigItem: Instantiated 2 NotificationCommands. information/ConfigItem: Instantiated 5272 CheckCommands. information/ConfigItem: Instantiated 265 Downtimes. information/ConfigItem: Instantiated 1 IcingaApplication. information/ConfigItem: Instantiated 3464 Hosts. information/ConfigItem: Instantiated 446 Comments. information/ConfigItem: Instantiated 2 UserGroups. information/ConfigItem: Instantiated 11 Users. information/ConfigItem: Instantiated 2 TimePeriods. information/ConfigItem: Instantiated 105743 Services. information/ConfigItem: Instantiated 1 CompatLogger. information/ConfigItem: Instantiated 1 StatusDataWriter. information/ConfigItem: Instantiated 1 ExternalCommandListener. information/ConfigItem: Instantiated 1 CheckerComponent. information/ConfigItem: Instantiated 1 GraphiteWriter. information/ConfigItem: Instantiated 1 IdoPgsqlConnection. information/ConfigItem: Instantiated 1 LivestatusListener. information/ConfigItem: Instantiated 1 NotificationComponent. information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 14991.3/s (899477/min 899477/5min 899477/15min); information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' information/cli: Finished validating the configuration file(s).

Anything Else?

I have definitely missed some important information. Bare with me and let me know what's missing so I can improve my bug report.

I'm not sure but that might be related to issue #5794 and might be easier to debug if issue #5213 would have been resolved :)

I'd be happy for any Idea how I can monitor this situation and get at least notified about an issue inside the cluster.

dnsmichi commented 6 years ago

More than 2 endpoints in a zone won't work.

Zone1Node1, Zone1Node2, Zone1Node3....

isn't supported atm and probably the culprit with such a loop.

Z3po commented 6 years ago

Sorry my table above is not very good to show the architecture...

All of the nodes are in their own zone as they have "useagent" set. Nevertheless the "parent"_-zone is the very same satellite zone that's what I wanted to show with it. But we have 4 hosts in the master zone...

I've read about issues but I couldn't find the corresponding documentation where it's stated to not have more than 2 masters in the master zone? Can you please point me to the correct location?

So is that issue caused by using 4 master nodes?

EDIT: I've updated my description above to show the nodes are not in the corresponding parent-zone.

dnsmichi commented 6 years ago

Can you share the zones.conf from one of the masters? The table above doesn't tell about 4 masters in a zone.

Z3po commented 6 years ago

Sure I can:

# icingacli director zone show master
object Zone "master" {
    endpoints = [
        "mmonitor-master-eu001.server.lan",
        "mmonitor-master-eu002.server.lan",
        "mmonitor-master-eu101.server.lan",
        "mmonitor-master-eu102.server.lan"
    ]
}

and then the zones beyond:

# icingacli director zone show Zone1
object Zone "Zone1" {
    parent = "master"
    endpoints = [
        "mmonitor-zoneworker-eu001.server.lan",
        "mmonitor-zoneworker-eu002.server.lan"
    ]
}

# icingacli director zone show Zone2
object Zone "Zone2" {
    parent = "master"
    endpoints = [
        "mmonitor-zoneworker-eu101.server.lan",
        "mmonitor-zoneworker-eu102.server.lan"
    ]
}

# icingacli director zone show Zone3
object Zone "Zone3" {
    parent = "master"
    endpoints = [
        "mmonitor-zoneworker-us001.server.lan",
        "mmonitor-zoneworker-us002.server.lan"
    ]
}

The table should have shown that the masters are split accross 2 "physical" zones meaning datacenters. Therefore 4 masters but in different zones....yeah that table is probably crap. Sorry.

dnsmichi commented 6 years ago

Ah ok. That's not gonna work, #3533 is the reason.

Z3po commented 6 years ago

Ah, ok, thanks! So we're not supposed to have more than 2 nodes in any zone, right?

Can you probably add this information to the cluster documentation: https://www.icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/ ? AFAICS it's not mentioned there...or am I wrong?

dnsmichi commented 6 years ago

You can add it too, just fork this repo and edit the corresponding file on GitHub. AFAIK it has been there, might have been accidentally removed.

Z3po commented 6 years ago

That's right..will look into it :)

Is there any way I can detect this misbehaviour in my cluster setup? The Icinga Service checks and cluster checks did not show any problem....

dnsmichi commented 6 years ago

The config validation should log a warning if your zone has more than two endpoints in it. https://github.com/Icinga/icinga2/blob/master/lib/remote/zone.cpp#L151

dnsmichi commented 6 years ago

AFAIK I've added something to the docs in a different PR.

Icinga / icinga2