Closed bardahlm closed 6 years ago
This could indicate a problem with your endpoint connections. If both endpoints remain in a split brain scenario, they would start to execute checks on their own.
You should also trace the check_source attribute for these kind of checks, e.g. by connecting to the API event streams on the master, filtering for a selected service.
I include check_source as a tag when sending data to influxdb, as indicated in the graph ("Checker: ").
service_template = {
measurement = "$service.check_command$"
tags = {
hostname = "$host.name$"
service = "$service.name$"
checker = "$service.check_source$"
}
}
How can one see if the endpoints are in a split brain scenario?
Cluster health checks for example.
Icinga 2 Cluster is running: Connected Endpoints: 8 (icinga-app02.domain.tld, icinga-app03.domain.tld, icinga-app04.domain.tld, icinga-app05.domain.tld, icinga-app07.domain.tld, icinga-app06.domain.tld, icinga-app08.domain.tld, icinga-app09.domain.tld).
I get an alarm immediately if I shut down one of the endpoints, so everything seems to be OK.
You could use the methods explained here to analyse the object authority on all involved endpoints.
https://www.icinga.com/2016/08/11/analyse-icinga-2-problems-using-the-console-api/
I ran the following command:
`$ curl -H "Accept: application/json" -k -s -u $ICINGA2_API_USERNAME:$ICINGA2_API_PASSWORD -X POST 'https://localhost:5665/v1/events?queue=debugchecks&types=CheckResult&filter=match%28%22ping4*%22,event.service%29' | grep switch.domain.tld
{"check_result":{"active":true,"check_source":"icinga-app08.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986427.3915150166,"execution_start":1505986418.3382890224,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 8.78 ms","performance_data":["rta=8.780000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986427.3915560246,"schedule_start":1505986418.3380548954,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986427.4008340836,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app09.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986557.3339450359,"execution_start":1505986548.2796299458,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 11.55 ms","performance_data":["rta=11.554000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986557.3339819908,"schedule_start":1505986548.2794408798,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986557.3372840881,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app08.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986728.3194990158,"execution_start":1505986719.2618420124,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 11.60 ms","performance_data":["rta=11.603000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986728.3195469379,"schedule_start":1505986719.2615737915,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986728.3284609318,"type":"CheckResult"} {"check_result":{"active":true,"check_source":"icinga-app09.domain.tld","command":["/usr/lib64/nagios/plugins/check_ping","-4","-H","10.220.241.31","-c","500,25%","-p","10","-w","300,20%"],"execution_end":1505986862.7052030563,"execution_start":1505986853.6542370319,"exit_status":0.0,"output":"PING OK - Packet loss = 0%, RTA = 8.32 ms","performance_data":["rta=8.316000ms;300.000000;500.000000;0.000000","pl=0%;20;25;0"],"schedule_end":1505986862.7052431107,"schedule_start":1505986853.6540000439,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"switch.domain.tld","service":"ping4","timestamp":1505986862.7090001106,"type":"CheckResult"} ` The checks are run at the following times:
endpoint execution_start relative time
icinga-app08 1505986418.3382890224 0
icinga-app09 1505986548.2796299458 130
icinga-app08 1505986719.2618420124 301
icinga-app09 1505986853.6542370319 435
To me it sounds like the two zone endpoints are not connected to each other. Furthermore, you are using 2.7.0 which is not supported anymore. Please upgrade to 2.8.0 and ensure that your cluster is intact.
Cheers, Michael
How are zone endpoints supposed to be connected to each other? How do I verify this?
E.g. by setting up health checks, more in the docs. Or you'll dive into the troubleshooting sections for cluster and check results.
I finally figured out what I had done wrong. The config on the satellite nodes missed a reference to the other node.
I added/changed as follows on the configuration on icinga-app08 (similiar changes on icinga-app09)
object Endpoint "icinga-app08.domain.tld" {
}
object Endpoint "icinga-app09.domain.tld" {
host = "icinga-app09.domain.tld"
}
object Zone "loadbalanced4" {
endpoints = [ "icinga-app08.domain.tld", "icinga-app09.domain.tld" ];
parent = "master";
}
Now checks seems to execute as they should.
Expected Behavior
Checks should run once every check_interval on one of the nodes in a zone.
Current Behavior
Checks seems to be run every check_interval on every node in the zone, in effect halving the check_interval if having two nodes in the zone.
Notice the graph below:
The colour indicate which node performed the check. The service has a check_interval of 300 seconds ( 5 minutes). As can be seen the nodes alternate checking the service, resulting in a check_interval of around 150 seconds.
Context
We have four zones, each with two nodes, checking around 5000 hosts and around 40000 services. The current behaviour means we are using twice the CPU and doubling the amount of performance data.
Your Environment
Version used (
icinga2 --version
): r2.7.0-1Operating System and version: rhel 7.4
Enabled features (
icinga2 feature list
): api checker command ido-mysql influxdb livestatus mainlog notificationIcinga Web 2 version and modules (System - About): 2.4.1
Config validation (
information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 24997, rate: 2519.62/s (151177/min 151177/5min 151177/15min);
information/WorkQueue: #7 (InfluxdbWriter, influxdb) items: 0, rate: 0/s (0/min 0/5min 0/15min);
information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate: 0/s (0/min 0/5min 0/15min);
warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/zones.d/global-templates/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere!
warning/ApplyRule: Apply rule 'ComputeBlade-' (in /etc/icinga2/zones.d/global-templates/services/ucs_compute_blade.conf: 1:0-1:89) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'ComputeRackUnit-' (in /etc/icinga2/zones.d/global-templates/services/ucs_compute_rackunit.conf: 1:0-1:98) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'EquipmentChassis-' (in /etc/icinga2/zones.d/global-templates/services/ucs_equipment_chassis.conf: 1:0-1:101) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'EquipmentIOCard-' (in /etc/icinga2/zones.d/global-templates/services/ucs_equipment_iocard.conf: 1:0-1:98) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'EquipmentPsu-' (in /etc/icinga2/zones.d/global-templates/services/ucs_equipment_psu.conf: 1:0-1:97) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'NetworkElement-' (in /etc/icinga2/zones.d/global-templates/services/ucs_network_element.conf: 1:0-1:95) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'ProcessorUnit-' (in /etc/icinga2/zones.d/global-templates/services/ucs_processor_unit.conf: 1:0-1:90) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'ProcessorUnit-' (in /etc/icinga2/zones.d/global-templates/services/ucs_processor_unit.conf: 10:1-10:97) for type 'Service' does not match anywhere!
information/ConfigItem: Instantiated 5 ApiUsers.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 6 Zones.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 9 Endpoints.
information/ConfigItem: Instantiated 1 LivestatusListener.
information/ConfigItem: Instantiated 4 UserGroups.
information/ConfigItem: Instantiated 16003 Notifications.
information/ConfigItem: Instantiated 6 NotificationCommands.
information/ConfigItem: Instantiated 118 CheckCommands.
information/ConfigItem: Instantiated 65 Downtimes.
information/ConfigItem: Instantiated 199 HostGroups.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 1 EventCommand.
information/ConfigItem: Instantiated 5039 Hosts.
information/ConfigItem: Instantiated 23 Comments.
information/ConfigItem: Instantiated 115 Dependencies.
information/ConfigItem: Instantiated 18 Users.
information/ConfigItem: Instantiated 5 TimePeriods.
information/ConfigItem: Instantiated 40842 Services.
information/ConfigItem: Instantiated 10 ServiceGroups.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 IdoMysqlConnection.
information/ConfigItem: Instantiated 1 InfluxdbWriter.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
icinga2 daemon -C
): information/cli: Icinga application loader (version: r2.7.0-1) information/cli: Loading configuration file(s). information/ConfigItem: Committing config item(s). information/ApiListener: My API identity:If you run multiple Icinga 2 instances, the
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.Master zones.conf:
object Endpoint "icinga-app01.domain.tld" { host = "icinga-app01.domain.tld" }
object Zone "icinga-app01.domain.tld" { endpoints = [ "icinga-app01.domain.tld" ] }
object Zone "global-templates" { global = true }
object Endpoint "icinga-app02.domain.tld" { host = "icinga-app02.domain.tld" }
object Endpoint "icinga-app03.domain.tld" { host = "icinga-app03.domain.tld" }
object Endpoint "icinga-app04.domain.tld" { host = "icinga-app04.domain.tld" }
object Endpoint "icinga-app05.domain.tld" { host = "icinga-app05.domain.tld" }
object Endpoint "icinga-app06.domain.tld" { host = "icinga-app06.domain.tld" }
object Endpoint "icinga-app07.domain.tld" { host = "icinga-app07.domain.tld" }
object Endpoint "icinga-app08.domain.tld" { host = "icinga-app08.domain.tld" }
object Endpoint "icinga-app09.domain.tld" { host = "icinga-app09.domain.tld" }
object Zone "loadbalanced1" { endpoints = [ "icinga-app02.domain.tld","icinga-app03.domain.tld" ] parent = "icinga-app01.domain.tld" } object Zone "loadbalanced2" { endpoints = [ "icinga-app04.domain.tld","icinga-app05.domain.tld" ] parent = "icinga-app01.domain.tld" }
object Zone "loadbalanced3" { endpoints = [ "icinga-app06.domain.tld","icinga-app07.domain.tld" ] parent = "icinga-app01.domain.tld" }
object Zone "loadbalanced4" { endpoints = [ "icinga-app08.domain.tld","icinga-app09.domain.tld" ] parent = "icinga-app01.domain.tld" }
Satellite zones.conf: object Endpoint "icinga-app01.domain.tld" { }
object Zone "master" { endpoints = [ "icinga-app01.domain.tld" ]; }
object Endpoint NodeName { }
object Zone "loadbalanced4" { endpoints = [ NodeName ]; parent = "master"; }
/*