Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Checks run on all nodes in a zone, not on only one #5591

Closed bardahlm closed 6 years ago

bardahlm commented 7 years ago

Expected Behavior

Checks should run once every check_interval on one of the nodes in a zone.

Current Behavior

Checks seems to be run every check_interval on every node in the zone, in effect halving the check_interval if having two nodes in the zone.

Notice the graph below:

icinga scheduler

The colour indicate which node performed the check. The service has a check_interval of 300 seconds ( 5 minutes). As can be seen the nodes alternate checking the service, resulting in a check_interval of around 150 seconds.

Context

We have four zones, each with two nodes, checking around 5000 hosts and around 40000 services. The current behaviour means we are using twice the CPU and doubling the amount of performance data.

Your Environment

Master zones.conf:

object Endpoint "icinga-app01.domain.tld" { host = "icinga-app01.domain.tld" }

object Zone "icinga-app01.domain.tld" { endpoints = [ "icinga-app01.domain.tld" ] }

object Zone "global-templates" { global = true }

object Endpoint "icinga-app02.domain.tld" { host = "icinga-app02.domain.tld" }

object Endpoint "icinga-app03.domain.tld" { host = "icinga-app03.domain.tld" }

object Endpoint "icinga-app04.domain.tld" { host = "icinga-app04.domain.tld" }

object Endpoint "icinga-app05.domain.tld" { host = "icinga-app05.domain.tld" }

object Endpoint "icinga-app06.domain.tld" { host = "icinga-app06.domain.tld" }

object Endpoint "icinga-app07.domain.tld" { host = "icinga-app07.domain.tld" }

object Endpoint "icinga-app08.domain.tld" { host = "icinga-app08.domain.tld" }

object Endpoint "icinga-app09.domain.tld" { host = "icinga-app09.domain.tld" }

object Zone "loadbalanced1" { endpoints = [ "icinga-app02.domain.tld","icinga-app03.domain.tld" ] parent = "icinga-app01.domain.tld" } object Zone "loadbalanced2" { endpoints = [ "icinga-app04.domain.tld","icinga-app05.domain.tld" ] parent = "icinga-app01.domain.tld" }

object Zone "loadbalanced3" { endpoints = [ "icinga-app06.domain.tld","icinga-app07.domain.tld" ] parent = "icinga-app01.domain.tld" }

object Zone "loadbalanced4" { endpoints = [ "icinga-app08.domain.tld","icinga-app09.domain.tld" ] parent = "icinga-app01.domain.tld" }

Satellite zones.conf: object Endpoint "icinga-app01.domain.tld" { }

object Zone "master" { endpoints = [ "icinga-app01.domain.tld" ]; }

object Endpoint NodeName { }

object Zone "loadbalanced4" { endpoints = [ NodeName ]; parent = "master"; }

/*

dnsmichi commented 7 years ago

This could indicate a problem with your endpoint connections. If both endpoints remain in a split brain scenario, they would start to execute checks on their own.

You should also trace the check_source attribute for these kind of checks, e.g. by connecting to the API event streams on the master, filtering for a selected service.

bardahlm commented 7 years ago

I include check_source as a tag when sending data to influxdb, as indicated in the graph ("Checker: ").

  service_template = {
    measurement = "$service.check_command$"
    tags = {
      hostname = "$host.name$"
      service = "$service.name$"
      checker = "$service.check_source$"
    }
  }

How can one see if the endpoints are in a split brain scenario?

dnsmichi commented 7 years ago

Cluster health checks for example.

bardahlm commented 7 years ago

Icinga 2 Cluster is running: Connected Endpoints: 8 (icinga-app02.domain.tld, icinga-app03.domain.tld, icinga-app04.domain.tld, icinga-app05.domain.tld, icinga-app07.domain.tld, icinga-app06.domain.tld, icinga-app08.domain.tld, icinga-app09.domain.tld).

I get an alarm immediately if I shut down one of the endpoints, so everything seems to be OK.

dnsmichi commented 7 years ago

You could use the methods explained here to analyse the object authority on all involved endpoints.

https://www.icinga.com/2016/08/11/analyse-icinga-2-problems-using-the-console-api/

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#late-check-results-in-distributed-environments

bardahlm commented 7 years ago

I ran the following command:

`$ curl -H "Accept: application/json" -k -s -u $ICINGA2_API_USERNAME:$ICINGA2_API_PASSWORD -X POST 'https://localhost:5665/v1/events?queue=debugchecks&types=CheckResult&filter=match%28%22ping4*%22,event.service%29' | grep switch.domain.tld

{"check_result":{"active":true,"check_source":"icinga-app08.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986427.3915150166,"execution_start":1505986418.3382890224,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 8.78 ms","performance_data":["rta=8.780000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986427.3915560246,"schedule_start":1505986418.3380548954,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986427.4008340836,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app09.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986557.3339450359,"execution_start":1505986548.2796299458,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 11.55 ms","performance_data":["rta=11.554000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986557.3339819908,"schedule_start":1505986548.2794408798,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986557.3372840881,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app08.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986728.3194990158,"execution_start":1505986719.2618420124,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 11.60 ms","performance_data":["rta=11.603000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986728.3195469379,"schedule_start":1505986719.2615737915,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986728.3284609318,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app09.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986862.7052030563,"execution_start":1505986853.6542370319,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 8.32 ms","performance_data":["rta=8.316000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986862.7052431107,"schedule_start":1505986853.6540000439,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986862.7090001106,"type":"CheckResult"} ` The checks are run at the following times:

endpoint       execution_start         relative time
icinga-app08   1505986418.3382890224     0
icinga-app09   1505986548.2796299458   130
icinga-app08   1505986719.2618420124   301
icinga-app09   1505986853.6542370319   435
dnsmichi commented 6 years ago

To me it sounds like the two zone endpoints are not connected to each other. Furthermore, you are using 2.7.0 which is not supported anymore. Please upgrade to 2.8.0 and ensure that your cluster is intact.

Cheers, Michael

bardahlm commented 6 years ago

How are zone endpoints supposed to be connected to each other? How do I verify this?

dnsmichi commented 6 years ago

E.g. by setting up health checks, more in the docs. Or you'll dive into the troubleshooting sections for cluster and check results.

bardahlm commented 6 years ago

I finally figured out what I had done wrong. The config on the satellite nodes missed a reference to the other node.

I added/changed as follows on the configuration on icinga-app08 (similiar changes on icinga-app09)

object Endpoint "icinga-app08.domain.tld" {
}

object Endpoint "icinga-app09.domain.tld" {
  host = "icinga-app09.domain.tld"
}

object Zone "loadbalanced4" {
  endpoints = [ "icinga-app08.domain.tld", "icinga-app09.domain.tld" ];
  parent = "master";
}

Now checks seems to execute as they should.