Checks run on all nodes in a zone, not on only one

bardahlm commented 7 years ago

Expected Behavior

Checks should run once every check_interval on one of the nodes in a zone.

Current Behavior

Checks seems to be run every check_interval on every node in the zone, in effect halving the check_interval if having two nodes in the zone.

Notice the graph below:

icinga scheduler

The colour indicate which node performed the check. The service has a check_interval of 300 seconds ( 5 minutes). As can be seen the nodes alternate checking the service, resulting in a check_interval of around 150 seconds.

Context

We have four zones, each with two nodes, checking around 5000 hosts and around 40000 services. The current behaviour means we are using twice the CPU and doubling the amount of performance data.

Your Environment

Version used (icinga2 --version): r2.7.0-1
Operating System and version: rhel 7.4
Enabled features (icinga2 feature list): api checker command ido-mysql influxdb livestatus mainlog notification
Icinga Web 2 version and modules (System - About): 2.4.1
Config validation (icinga2 daemon -C): information/cli: Icinga application loader (version: r2.7.0-1) information/cli: Loading configuration file(s). information/ConfigItem: Committing config item(s). information/ApiListener: My API identity: information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min); information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 24997, rate: 2519.62/s (151177/min 151177/5min 151177/15min); information/WorkQueue: #7 (InfluxdbWriter, influxdb) items: 0, rate: 0/s (0/min 0/5min 0/15min); information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min); warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/zones.d/global-templates/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere! warning/ApplyRule: Apply rule 'ComputeBlade-' (in /etc/icinga2/zones.d/global-templates/services/ucs_compute_blade.conf: 1:0-1:89) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'ComputeRackUnit-' (in /etc/icinga2/zones.d/global-templates/services/ucs_compute_rackunit.conf: 1:0-1:98) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'EquipmentChassis-' (in /etc/icinga2/zones.d/global-templates/services/ucs_equipment_chassis.conf: 1:0-1:101) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'EquipmentIOCard-' (in /etc/icinga2/zones.d/global-templates/services/ucs_equipment_iocard.conf: 1:0-1:98) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'EquipmentPsu-' (in /etc/icinga2/zones.d/global-templates/services/ucs_equipment_psu.conf: 1:0-1:97) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'NetworkElement-' (in /etc/icinga2/zones.d/global-templates/services/ucs_network_element.conf: 1:0-1:95) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'ProcessorUnit-' (in /etc/icinga2/zones.d/global-templates/services/ucs_processor_unit.conf: 1:0-1:90) for type 'Service' does not match anywhere! warning/ApplyRule: Apply rule 'ProcessorUnit-' (in /etc/icinga2/zones.d/global-templates/services/ucs_processor_unit.conf: 10:1-10:97) for type 'Service' does not match anywhere! information/ConfigItem: Instantiated 5 ApiUsers. information/ConfigItem: Instantiated 1 ApiListener. information/ConfigItem: Instantiated 6 Zones. information/ConfigItem: Instantiated 1 FileLogger. information/ConfigItem: Instantiated 9 Endpoints. information/ConfigItem: Instantiated 1 LivestatusListener. information/ConfigItem: Instantiated 4 UserGroups. information/ConfigItem: Instantiated 16003 Notifications. information/ConfigItem: Instantiated 6 NotificationCommands. information/ConfigItem: Instantiated 118 CheckCommands. information/ConfigItem: Instantiated 65 Downtimes. information/ConfigItem: Instantiated 199 HostGroups. information/ConfigItem: Instantiated 1 IcingaApplication. information/ConfigItem: Instantiated 1 EventCommand. information/ConfigItem: Instantiated 5039 Hosts. information/ConfigItem: Instantiated 23 Comments. information/ConfigItem: Instantiated 115 Dependencies. information/ConfigItem: Instantiated 18 Users. information/ConfigItem: Instantiated 5 TimePeriods. information/ConfigItem: Instantiated 40842 Services. information/ConfigItem: Instantiated 10 ServiceGroups. information/ConfigItem: Instantiated 1 CheckerComponent. information/ConfigItem: Instantiated 1 ExternalCommandListener. information/ConfigItem: Instantiated 1 IdoMysqlConnection. information/ConfigItem: Instantiated 1 InfluxdbWriter. information/ConfigItem: Instantiated 1 NotificationComponent. information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars' information/cli: Finished validating the configuration file(s).
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

Master zones.conf:

object Endpoint "icinga-app01.domain.tld" { host = "icinga-app01.domain.tld" }

object Zone "icinga-app01.domain.tld" { endpoints = [ "icinga-app01.domain.tld" ] }

object Zone "global-templates" { global = true }

object Endpoint "icinga-app02.domain.tld" { host = "icinga-app02.domain.tld" }

object Endpoint "icinga-app03.domain.tld" { host = "icinga-app03.domain.tld" }

object Endpoint "icinga-app04.domain.tld" { host = "icinga-app04.domain.tld" }

object Endpoint "icinga-app05.domain.tld" { host = "icinga-app05.domain.tld" }

object Endpoint "icinga-app06.domain.tld" { host = "icinga-app06.domain.tld" }

object Endpoint "icinga-app07.domain.tld" { host = "icinga-app07.domain.tld" }

object Endpoint "icinga-app08.domain.tld" { host = "icinga-app08.domain.tld" }

object Endpoint "icinga-app09.domain.tld" { host = "icinga-app09.domain.tld" }

object Zone "loadbalanced1" { endpoints = [ "icinga-app02.domain.tld","icinga-app03.domain.tld" ] parent = "icinga-app01.domain.tld" } object Zone "loadbalanced2" { endpoints = [ "icinga-app04.domain.tld","icinga-app05.domain.tld" ] parent = "icinga-app01.domain.tld" }

object Zone "loadbalanced3" { endpoints = [ "icinga-app06.domain.tld","icinga-app07.domain.tld" ] parent = "icinga-app01.domain.tld" }

object Zone "loadbalanced4" { endpoints = [ "icinga-app08.domain.tld","icinga-app09.domain.tld" ] parent = "icinga-app01.domain.tld" }

Satellite zones.conf: object Endpoint "icinga-app01.domain.tld" { }

object Zone "master" { endpoints = [ "icinga-app01.domain.tld" ]; }

object Endpoint NodeName { }

object Zone "loadbalanced4" { endpoints = [ NodeName ]; parent = "master"; }

/*

Global zone for templates */ object Zone "global-templates" { global = true }

dnsmichi commented 7 years ago

This could indicate a problem with your endpoint connections. If both endpoints remain in a split brain scenario, they would start to execute checks on their own.

You should also trace the check_source attribute for these kind of checks, e.g. by connecting to the API event streams on the master, filtering for a selected service.

bardahlm commented 7 years ago

I include check_source as a tag when sending data to influxdb, as indicated in the graph ("Checker: ").

  service_template = {
    measurement = "$service.check_command$"
    tags = {
      hostname = "$host.name$"
      service = "$service.name$"
      checker = "$service.check_source$"
    }
  }

How can one see if the endpoints are in a split brain scenario?

dnsmichi commented 7 years ago

Cluster health checks for example.

bardahlm commented 7 years ago

Icinga 2 Cluster is running: Connected Endpoints: 8 (icinga-app02.domain.tld, icinga-app03.domain.tld, icinga-app04.domain.tld, icinga-app05.domain.tld, icinga-app07.domain.tld, icinga-app06.domain.tld, icinga-app08.domain.tld, icinga-app09.domain.tld).

I get an alarm immediately if I shut down one of the endpoints, so everything seems to be OK.

dnsmichi commented 7 years ago

You could use the methods explained here to analyse the object authority on all involved endpoints.

https://www.icinga.com/2016/08/11/analyse-icinga-2-problems-using-the-console-api/

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#late-check-results-in-distributed-environments

bardahlm commented 7 years ago

I ran the following command:

`$ curl -H "Accept: application/json" -k -s -u $ICINGA2_API_USERNAME:$ICINGA2_API_PASSWORD -X POST 'https://localhost:5665/v1/events?queue=debugchecks&types=CheckResult&filter=match%28%22ping4*%22,event.service%29' | grep switch.domain.tld

{"check_result":{"active":true,"check_source":"icinga-app08.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986427.3915150166,"execution_start":1505986418.3382890224,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 8.78 ms","performance_data":["rta=8.780000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986427.3915560246,"schedule_start":1505986418.3380548954,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986427.4008340836,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app09.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986557.3339450359,"execution_start":1505986548.2796299458,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 11.55 ms","performance_data":["rta=11.554000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986557.3339819908,"schedule_start":1505986548.2794408798,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986557.3372840881,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app08.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986728.3194990158,"execution_start":1505986719.2618420124,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 11.60 ms","performance_data":["rta=11.603000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986728.3195469379,"schedule_start":1505986719.2615737915,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986728.3284609318,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app09.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986862.7052030563,"execution_start":1505986853.6542370319,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 8.32 ms","performance_data":["rta=8.316000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986862.7052431107,"schedule_start":1505986853.6540000439,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986862.7090001106,"type":"CheckResult"} ` The checks are run at the following times:

endpoint       execution_start         relative time
icinga-app08   1505986418.3382890224     0
icinga-app09   1505986548.2796299458   130
icinga-app08   1505986719.2618420124   301
icinga-app09   1505986853.6542370319   435

dnsmichi commented 6 years ago

To me it sounds like the two zone endpoints are not connected to each other. Furthermore, you are using 2.7.0 which is not supported anymore. Please upgrade to 2.8.0 and ensure that your cluster is intact.

Cheers, Michael

bardahlm commented 6 years ago

How are zone endpoints supposed to be connected to each other? How do I verify this?

dnsmichi commented 6 years ago

E.g. by setting up health checks, more in the docs. Or you'll dive into the troubleshooting sections for cluster and check results.

bardahlm commented 6 years ago

I finally figured out what I had done wrong. The config on the satellite nodes missed a reference to the other node.

I added/changed as follows on the configuration on icinga-app08 (similiar changes on icinga-app09)

object Endpoint "icinga-app08.domain.tld" {
}

object Endpoint "icinga-app09.domain.tld" {
  host = "icinga-app09.domain.tld"
}

object Zone "loadbalanced4" {
  endpoints = [ "icinga-app08.domain.tld", "icinga-app09.domain.tld" ];
  parent = "master";
}

Now checks seems to execute as they should.

Icinga / icinga2