Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

[dev.icinga.com #11041] Status update storm on three node cluster when using "command_endpoint" on services #3879

Closed icinga-migration closed 7 years ago

icinga-migration commented 8 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11041

Created by carljohnston on 2016-01-27 06:13:36 +00:00

Assignee: (none) Status: Closed (closed on 2016-12-07 21:53:36 +00:00) Target Version: (none) Last Update: 2016-12-07 21:53:36 +00:00 (in Redmine)

Icinga Version: 2.4.1-1
Backport?: Not yet backported
Include in Changelog: 1

Hi Devs,

When configuring a three node HA master zone (which will eventually have two node HA satellite zones attached), I have come across an issue with API updates being stormed (100s per second) to all three nodes.

My setup is:

What I've found is that:

Debug log has countless entries that look like: @ [2016-01-27 01:07:43 -0500] notice/JsonRpcConnection: Received 'event::CheckResult' message from 'master2' [2016-01-27 01:07:43 -0500] debug/Checkable: command_endpoint found for object 'master1!icinga', setting master1 as check_source. [2016-01-27 01:07:43 -0500] debug/DbEvents: add checkable check history for 'master1!icinga' [2016-01-27 01:07:43 -0500] notice/ApiListener: Relaying 'event::CheckResult' message [2016-01-27 01:07:43 -0500] notice/ApiListener: Sending message to 'master3' [2016-01-27 01:07:43 -0500] debug/DbObject: Endpoint node: 'master1' status update for 'master1!icinga' @

This appears to affect:

Thanks for any help you can provide,

Carl

Attachments


Relations:

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-29 09:23:35 +00:00

That is probably the same issue we've been debugging and fixing at a customer lately. Check #11014 for details. Can you test the snapshot packages please?

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-01-29 09:23:43 +00:00

icinga-migration commented 8 years ago

Updated by carljohnston on 2016-02-01 22:48:26 +00:00

dnsmichi wrote:

That is probably the same issue we've been debugging and fixing at a customer lately. Check #11014 for details. Can you test the snapshot packages please?

Hi dnsmichi,

Thanks for taking the time to look at this.

I tried a few snapshots yesterday - these were from snapshot201601292015 and previous - which came along with some API corruption bugs. These seemed to fix the update storm, but broke core functionality - it appeared that the non-zone-master cluster members weren't connected correctly to the cluster, command_endpoint enabled checks returned "Unknown" as the member's weren't contactable.

I've tried the snapshots from today (snapshot201602012014; version: v2.4.1-159-gec050dd), and the storm still exists with three nodes connected, but core functionality is returned.

Can I provide you with any other information to assist with resolving this?

Thank you,

Carl

icinga-migration commented 8 years ago

Updated by carljohnston on 2016-02-01 23:25:43 +00:00

I've tried the snapshots from today (snapshot201602012014; version: v2.4.1-159-gec050dd), and the storm still exists with three nodes connected, but core functionality is returned.

Hi dnsmichi,

I've just tried a few more snapshots; snapshot201602011352 does not have the update storm, but core functionality is broken (endpoints appear disconnected in icingaweb2); snapshot201602011404 re-introduces the update storm but fixes the disconnected endpoints.

Thank you,

Carl

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-02-24 20:12:08 +00:00

Does the issue still exist with 2.4.3?

icinga-migration commented 8 years ago

Updated by karambol on 2016-02-25 11:17:40 +00:00

I have this issue from version 2.4.2 (and 2.4.3)

icinga-migration commented 8 years ago

Updated by mnardin on 2016-02-26 14:57:51 +00:00

Hi, we are experiencing a similar behavior on our 2 master setup + 2 zones (1 satellite per zone) running on icinga v2.4.2. I've noticed that the pnp4nagios instance on the master was complaining about an "illegal attempt to update". See the following /var/log/pnp4nagios/perfdata.log snippet:

2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/lxload_load_1_min.rrd 1456497904:0.16
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497904.000000 when last update time is 1456497904.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/lxload_load_5_min.rrd 1456497904:0.09
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497904.000000 when last update time is 1456497904.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/lxload_load_15_min.rrd 1456497904:0.08
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497904.000000 when last update time is 1456497904.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_proc.rrd 1456497920:1
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_mem.rrd 1456497920:5.1
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_cpu.rrd 1456497920:6.4
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_rss.rrd 1456497920:201724.0
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam01-p/check-icinga2-proc_vsz.rrd 1456497920:1273904.0
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/memory-linux-icinga_ram_used.rrd 1456497920:1332304
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/memory-linux-icinga_swap_used.rrd 1456497920:17116
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497920.000000 when last update time is 1456497920.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/grafcrypbal01-t/memory-linux_ram_used.rrd 1456497926:561956
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497926.000000 when last update time is 1456497926.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/grafcrypbal01-t/memory-linux_swap_used.rrd 1456497926:0
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497926.000000 when last update time is 1456497926.000000 (minimum one second step)
2016-02-26 15:45:39 [31793] [0] RRDs::update /var/lib/pnp4nagios/icingam02-p/CPU_cpu_prct_used.rrd 1456497929:6.5
2016-02-26 15:45:39 [31793] [0] RRDs::update ERROR rrdcached: illegal attempt to update using time 1456497929.000000 when last update time is 1456497929.000000 (minimum one second step

Most of the RRDs where this problem is happening have the associated service-object with the command_endpoint property set to a given icinga2 instance.

Hope this helps.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-02-26 15:03:36 +00:00

Please add the following details.

icinga-migration commented 8 years ago

Updated by mnardin on 2016-02-26 15:40:54 +00:00

I've used ".fqdn" to omit our internal fqdn. I don't know if I can share stuff like this, but I think it shouldn't be a problem in this case. I've also have attached all the objects attribute of the following example:

object Host "icingam02-p.fqdn" {
  import "standard-host"

  display_name = "icingam02-p"
  address = "10.139.163.131"
  command_endpoint = "icingam01-p.fqdn"

  vars.os = "linux"
  vars.zone = "master"
  vars.checker = "icingam01-p.fqdn"
}

apply Service "CPU" {
  import "standard-service"
  display_name = "CPU State"
  check_command = "snmp-load"
  vars.check_protocol = "snmp"
  check_interval = 1m
  retry_interval = 30s

  if (host.vars.checker) {
    command_endpoint = host.vars.checker
  }

  assign where host.vars.os == "windows" || host.vars.os == "linux"
}

All zones.conf:

[root@icingam01-p ~]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
        host = "icingam01-p.fqdn"
}

object Endpoint "icingam02-p.fqdn" {
        host = "icingam02-p.fqdn"
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-p.fqdn" {
        host = "icingas01-p.fqdn"
}

object Zone "RZ" {
        endpoints = [ "icingas01-p.fqdn" ]
        parent = "master"
}

object Endpoint "icingas01-d.fqdn" {
        host = "icingas01-d.fqdn"
}

object Zone "DR" {
        endpoints = [ "icingas01-d.fqdn" ]
        parent = "master"
}
----------------------------------------
[root@icingam02-p ~]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
        host = "icingam01-p.fqdn"
}

object Endpoint "icingam02-p.fqdn" {
        host = "icingam02-p.fqdn"
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-p.fqdn" {
        host = "icingas01-p.fqdn"
}

object Zone "RZ" {
        endpoints = [ "icingas01-p.fqdn" ]
        parent = "master"
}

object Endpoint "icingas01-d.fqdn" {
        host = "icingas01-d.fqdn"
}

object Zone "DR" {
        endpoints = [ "icingas01-d.fqdn" ]
        parent = "master"
}
----------------------------------------
[root@icingas01-p zones]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
}

object Endpoint "icingam02-p.fqdn" {
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-p.fqdn" {
        host = "icingas01-p.fqdn"
}

object Zone "RZ" {
        endpoints = [ "icingas01-p.fqdn" ]
        parent = "master"
}
----------------------------------------
[root@icingas01-d ~]# cat /etc/icinga2/zones.conf
object Zone "director-global" {
        global = true
}

object Zone "ris-global" {
        global = true
}

object Endpoint "icingam01-p.fqdn" {
}

object Endpoint "icingam02-p.fqdn" {
}

object Zone "master" {
        endpoints = [ "icingam01-p.fqdn", "icingam02-p.fqdn" ]
}

object Endpoint "icingas01-d.fqdn" {
        host = "icingas01-d.fqdn"
}

object Zone "DR" {
        endpoints = [ "icingas01-d.fqdn" ]
        parent = "master"
}
icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-02-26 18:02:42 +00:00

Hm, so you are trying to pin the check inside the same zone on a specific host (master01 should check master02 via command_endpoint and vice versa).

Since you're using the director and api already, can you connect to the /v1/events endpoint, add a queue and type=CheckResult and a filter for your host/service name? I would guess there are multiple check results involved causing these update loops.

icinga-migration commented 8 years ago

Updated by mnardin on 2016-02-29 15:25:24 +00:00

I was able to get some data. This are the events through the api on both masters. I get 2 events on the first master where the check is pinned on: icingam01-p:

{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758891.0741410255,"execution_start":1456758891.0006890297,"exit_status":0.0,"output":"2 CPU, average load 8.5% < 85% : OK ","performance_data":["cpu_prct_used=8.5%;85;95"],"schedule_end":1456758891.0742049217,"schedule_start":1456758951.0,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758891.0743079185,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758891.0741410255,"execution_start":1456758891.0006890297,"exit_status":0.0,"output":"2 CPU, average load 8.5% < 85% : OK ","performance_data":["cpu_prct_used=8.5%;85;95"],"schedule_end":1456758891.0742049217,"schedule_start":1456758951.0,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758891.1166100502,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758951.0681591034,"execution_start":1456758951.0007469654,"exit_status":0.0,"output":"2 CPU, average load 6.5% < 85% : OK ","performance_data":["cpu_prct_used=6.5%;85;95"],"schedule_end":1456758951.0682110786,"schedule_start":1456759011.0000002384,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758951.0683379173,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758951.0681591034,"execution_start":1456758951.0007469654,"exit_status":0.0,"output":"2 CPU, average load 6.5% < 85% : OK ","performance_data":["cpu_prct_used=6.5%;85;95"],"schedule_end":1456758951.0682110786,"schedule_start":1456759011.0000002384,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758951.0710589886,"type":"CheckResult"}

icingam02-p:

{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758891.0741410255,"execution_start":1456758891.0006890297,"exit_status":0.0,"output":"2 CPU, average load 8.5% < 85% : OK ","performance_data":["cpu_prct_used=8.5%;85;95"],"schedule_end":1456758891.0742049217,"schedule_start":1456758951.0,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758891.0760920048,"type":"CheckResult"}
{"check_result":{"active":true,"check_source":"icingam01-p.fqdn","command":["/opt/monitoring-plugins/manubulon/plugins/check_snmp_load.pl","-C","public","-H","10.139.163.131","-T","stand","-c","95","-f","-t","5","-w","85"],"execution_end":1456758951.0681591034,"execution_start":1456758951.0007469654,"exit_status":0.0,"output":"2 CPU, average load 6.5% < 85% : OK ","performance_data":["cpu_prct_used=6.5%;85;95"],"schedule_end":1456758951.0682110786,"schedule_start":1456759011.0000002384,"state":0.0,"type":"CheckResult","vars_after":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0},"vars_before":{"attempt":1.0,"reachable":true,"state":0.0,"state_type":1.0}},"host":"icingam02-p.fqdn","service":"CPU","timestamp":1456758951.0696389675,"type":"CheckResult"}
icinga-migration commented 8 years ago

Updated by mnardin on 2016-02-29 16:58:45 +00:00

I was trying to gather data regarding another problem that we are experiencing right now: high load peeks on the satellites after pushing a config. In the /var/log/pnp4nagios/perfdata.log I've noticed that on every config reload both masters are receiving 2 check results (hostalive) for almost every host in our config. This could be the culprit for our load problems.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-04 15:54:18 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-18 11:19:46 +00:00

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-03-18 14:27:42 +00:00

There was a bug in 2.4.2 which caused multiple check updates. Can you re-test with 2.4.4 if the issue you're having persists please?

icinga-migration commented 8 years ago

Updated by mnardin on 2016-04-05 17:11:14 +00:00

The issue is still present with 2.4.4. I've tested with the same object as above.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-04-06 16:01:36 +00:00

Ok, we'll have to look into that. Thanks for your feedback.

Kind regards, Michael

icinga-migration commented 8 years ago

Updated by gbeutner on 2016-07-25 07:45:59 +00:00

Can you please test whether this problem still occurs with the current snapshot packages? As far as I can see this should have been fixed as part of #12179.

icinga-migration commented 8 years ago

Updated by gbeutner on 2016-07-25 07:46:16 +00:00

icinga-migration commented 8 years ago

Updated by mnardin on 2016-09-14 10:01:56 +00:00

This problem seems to be fixed in 2.5.4.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2016-11-09 14:52:18 +00:00

icinga-migration commented 7 years ago

Updated by mfriedrich on 2016-12-07 21:53:36 +00:00

Cool thanks.