Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Multi-master multi-zone sync-loop causing active checks getting stale #6422

Closed Z3po closed 6 years ago

Z3po commented 6 years ago

When working with a multi-master setup and having multiple zones a config-sync loop (hickup?) will result in stale checks which go unrecognized

Our Setup:

Zone 1 Zone 2 Zone 3
Master1, Master2 Master3, Master4
Zone1Satellite1, Zone1Satellite2 Zone2Satellite1, Zone2Satellite2 Zone3Satellite1, Zone3Satellite2
Node1, Node2, Node3.... Node100, Node101, Node103... Node201, Node202, Node203....

we are using that Setup to monitor ~ 110k Services on ~3,5k Hosts. The Master Zone is replicating it's data to a postgres database and the results are visualized via icingaweb2. HA-Flag is enabled (all masters share the same database). On top of that setup we are using icinga director to create new hosts and services.

The issue appeared when we acknowledged service issues on ~ 400 hosts in bulk on 2018-06-26 ~ 17:00.

While recognizing something is wrong the masters have been restarted on 2018-06-26 ~ 00:26 and one of the satellite daemon was restarted on 2018-06-27 12:00.

Expected Behaviour

The acknowledgments and comments are being synced smoothly and checks continue being executed in time (like they should)

Current Behavior

The config-sync down to the satellite zones seems to be stuck in a config-sync loop. There's no indication of what's going wrong just just loglines like the following over and over again:

[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Triggering Start signal for config items
[2018-06-26 20:24:31 +0200] information/ConfigItem: Activated all objects.
[2018-06-26 20:24:31 +0200] information/ConfigItem: Committing config item(s).
[2018-06-26 20:24:31 +0200] information/ConfigItem: Instantiated 1 Comment.

In numbers:

# zgrep -h 'Triggering Start signal for config items' icinga2.log* | cut -c -14 | sort | uniq -c
     70 [2018-06-25 10
     63 [2018-06-25 15
      5 [2018-06-26 09
    131 [2018-06-26 11
     35 [2018-06-26 13
    107 [2018-06-26 14
    111 [2018-06-26 15
      4 [2018-06-26 16
  99452 [2018-06-26 17
 113210 [2018-06-26 18
 112075 [2018-06-26 19
 113379 [2018-06-26 20
 113253 [2018-06-26 21
 113955 [2018-06-26 22
 114771 [2018-06-26 23
  96737 [2018-06-27 00
  62995 [2018-06-27 01
  63233 [2018-06-27 02
  63006 [2018-06-27 03
  63074 [2018-06-27 04
  62947 [2018-06-27 05
  63091 [2018-06-27 06
  62842 [2018-06-27 07
  62739 [2018-06-27 08
  63225 [2018-06-27 09
  63145 [2018-06-27 10
  63013 [2018-06-27 11
    693 [2018-06-27 12
     31 [2018-06-27 13
    147 [2018-06-27 14
     53 [2018-06-27 15
      1 [2018-06-27 17
      1 [2018-06-28 10
    115 [2018-06-28 13
     39 [2018-06-28 14
      3 [2018-06-28 15
      2 [2018-06-28 18
     35 [2018-06-29 09
      1 [2018-06-29 17
    219 [2018-07-02 10

When checking the filesystem it appeared that the comments have only partly been synced. Deleting one of the available comments caused them just to re-appear during the config-sync loop.

Additionally all checks depending on the satellites (icinga agents are connected to the satellites in their specific zone) are getting stale and are staying in an old state. There's no indication that anything is going wrong the icingaweb instance only tells me that checks should have been executed since a day or more.

Possible Solution

The issue was partly solved by restarting the satellites in that specific zone. Deleting the already synced comments and restarting the daemon again fully solved it (at least then the comments locally available in /var/lib/icinga2/api/packages/_api/Zone2Satellite1-12345678-1/conf.d/comments matched the ones from the master instances.

Steps to Reproduce (for bugs)

I've not (yet) been able to reproduce the issue. It happens from time to time.

Context

Like stated in "Current Behaviour" checks are stale without further notice as the satellite daemons are stuck in a config-sync loop and do not schedule checks anymore. The Icinga cluster or icinga service checks do not indicate an issue. I'm not aware of any other health-check which shows the erroneous behaviour.

Your Environment

Anything Else?

I have definitely missed some important information. Bare with me and let me know what's missing so I can improve my bug report.

I'm not sure but that might be related to issue #5794 and might be easier to debug if issue #5213 would have been resolved :)

I'd be happy for any Idea how I can monitor this situation and get at least notified about an issue inside the cluster.

dnsmichi commented 6 years ago

More than 2 endpoints in a zone won't work.

Zone1Node1, Zone1Node2, Zone1Node3....

isn't supported atm and probably the culprit with such a loop.

Z3po commented 6 years ago

Sorry my table above is not very good to show the architecture...

All of the nodes are in their own zone as they have "useagent" set. Nevertheless the "parent"_-zone is the very same satellite zone that's what I wanted to show with it. But we have 4 hosts in the master zone...

I've read about issues but I couldn't find the corresponding documentation where it's stated to not have more than 2 masters in the master zone? Can you please point me to the correct location?

So is that issue caused by using 4 master nodes?

EDIT: I've updated my description above to show the nodes are not in the corresponding parent-zone.

dnsmichi commented 6 years ago

Can you share the zones.conf from one of the masters? The table above doesn't tell about 4 masters in a zone.

Z3po commented 6 years ago

Sure I can:

# icingacli director zone show master
object Zone "master" {
    endpoints = [
        "mmonitor-master-eu001.server.lan",
        "mmonitor-master-eu002.server.lan",
        "mmonitor-master-eu101.server.lan",
        "mmonitor-master-eu102.server.lan"
    ]
}

and then the zones beyond:

# icingacli director zone show Zone1
object Zone "Zone1" {
    parent = "master"
    endpoints = [
        "mmonitor-zoneworker-eu001.server.lan",
        "mmonitor-zoneworker-eu002.server.lan"
    ]
}

# icingacli director zone show Zone2
object Zone "Zone2" {
    parent = "master"
    endpoints = [
        "mmonitor-zoneworker-eu101.server.lan",
        "mmonitor-zoneworker-eu102.server.lan"
    ]
}

# icingacli director zone show Zone3
object Zone "Zone3" {
    parent = "master"
    endpoints = [
        "mmonitor-zoneworker-us001.server.lan",
        "mmonitor-zoneworker-us002.server.lan"
    ]
}

The table should have shown that the masters are split accross 2 "physical" zones meaning datacenters. Therefore 4 masters but in different zones....yeah that table is probably crap. Sorry.

dnsmichi commented 6 years ago

Ah ok. That's not gonna work, #3533 is the reason.

Z3po commented 6 years ago

Ah, ok, thanks! So we're not supposed to have more than 2 nodes in any zone, right?

Can you probably add this information to the cluster documentation: https://www.icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/ ? AFAICS it's not mentioned there...or am I wrong?

dnsmichi commented 6 years ago

You can add it too, just fork this repo and edit the corresponding file on GitHub. AFAIK it has been there, might have been accidentally removed.

Z3po commented 6 years ago

That's right..will look into it :)

Is there any way I can detect this misbehaviour in my cluster setup? The Icinga Service checks and cluster checks did not show any problem....

dnsmichi commented 6 years ago

The config validation should log a warning if your zone has more than two endpoints in it. https://github.com/Icinga/icinga2/blob/master/lib/remote/zone.cpp#L151

dnsmichi commented 6 years ago

AFAIK I've added something to the docs in a different PR.