Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.02k stars 577 forks source link

Satellite in HA mode dies when master is updated #5021

Closed deric closed 7 years ago

deric commented 7 years ago

I have a HA setup with 2 masters and 2 satellites. After updating configuration master, one satellite dies other stays in running state.

[2017-02-22 14:12:39 +0000] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/system.conf'
[2017-02-22 14:12:39 +0000] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/global-templates/_etc/templates.conf'
[2017-02-22 14:12:39 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp
[2017-02-22 14:12:39 +0000] information/ApiListener: Restarting after configuration change.
[2017-02-22 14:12:39 +0000] notice/ThreadPool: Pool #1: Pending tasks: 0; Average latency: 0ms; Threads: 4; Pool utilization: 0.00131848%
[2017-02-22 14:12:41 +0000] notice/JsonRpcConnection: Received 'event::Heartbeat' message from 'master01.example.net'
[2017-02-22 14:12:41 +0000] notice/WorkQueue: Spawning WorkQueue threads for 'JsonRpcConnection, #2'
[2017-02-22 14:12:41 +0000] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from 'agent01.example.net'
[2017-02-22 14:12:41 +0000] information/Application: Got reload command: Starting new instance.
[2017-02-22 14:12:41 +0000] notice/Process: Running command '/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '-e' '/var/log/icinga2/error.log' '--reload-internal' '20521': PID 20593
[2017-02-22 14:12:42 +0000] information/Application: Received request to shut down.
[2017-02-22 14:12:42 +0000] warning/TlsStream: TLS stream was disconnected.
[2017-02-22 14:12:42 +0000] warning/JsonRpcConnection: API client disconnected for identity 'dev01.example.net'
[2017-02-22 14:12:42 +0000] warning/ApiListener: Removing API client for endpoint 'dev01.example.net'. 0 API clients left.
[2017-02-22 14:12:42 +0000] information/Application: Shutting down...
[2017-02-22 14:12:42 +0000] information/CheckerComponent: Checker stopped.

parent process becomes a zombie:

nagios   20521 81.2  0.0      0     0 ?        Zsl  14:12   0:13 [icinga2] <defunct>
nagios   20541  0.0  0.0 499864  1436 ?        S    14:12   0:00  \_ /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/error
nagios   20593  1.0  0.0 1065576 16408 ?       SNsl 14:12   0:00      \_ /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/e
nagios   20606  0.0  0.0 510612  1220 ?        SN   14:12   0:00          \_ /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icing

from systemd logs:

icinga2.service: main process exited, code=killed, status=11/SEGV
Unit icinga2.service entered failed state.

I'm running icinga satellite in a LXC container with memory limit of 2G (I hope that should be enough). The system runs on Debian 8 (Jessie) with latest icinga2 version (2.6.2-1~debmon8+1). On the host I've notice segfaults in the statellite's container:

icinga2[6004]: segfault at 7ff5e7d2ff90 ip 00002b9fc3b4d03f sp 00007ff5e7d2ff80 error 6 in libicinga.so.2.6.2[2b9fc3a6b000+314000]
dnsmichi commented 7 years ago

Can you please check /var/log/icinga2 if there's any crash log available? I'd love to see the stack trace the SIGSEGV is causing.

If you are able to reliably reproduce the issue (e.g. by issuing a master config reload with new configuration applied) please add more configuration details (zones.conf, output of icinga2 --version and icinga2 daemon -C on a master and the affected satellite).

deric commented 7 years ago

There's no crash log. Standard log /var/log/icinga2/icinga2.log ends always with following:

[2017-02-23 09:07:40 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp
[2017-02-23 09:07:40 +0000] information/ApiListener: Restarting after configuration change.
[2017-02-23 09:07:40 +0000] information/ApiListener: Finished sending replay log for endpoint 'agent01.example.net'.
[2017-02-23 09:07:43 +0000] information/Application: Got reload command: Starting new instance.
[2017-02-23 09:07:43 +0000] information/Application: Received request to shut down.
[2017-02-23 09:07:43 +0000] information/Application: Shutting down...
[2017-02-23 09:07:43 +0000] information/CheckerComponent: Checker stopped.

icinga2 --version:

icinga2 - The Icinga 2 network monitoring daemon (version: r2.6.2-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Debian GNU/Linux
  Platform version: 8 (jessie)
  Kernel: Linux
  Kernel version: 3.16.0-4-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.9.2
  Build host: smithers

The segfault above was caused by insufficient memory. I've realized that one container was configured only for 1GB, which wasn't enough for configuration reload. Now both satellites are configured with 4GB and the issue seems to be easily reproducible.

For example, when masters are down, satellites are still running. However when master comes online, satellite crashes:

[2017-02-23 09:45:39 +0000] information/ApiListener: Restarting after configuration change.
[2017-02-23 09:45:39 +0000] information/ApiListener: Replayed 3327 messages.
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replayed 3327 messages.
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1487843139
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1487843140
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replayed 0 messages.
[2017-02-23 09:45:39 +0000] information/ApiListener: Finished sending replay log for endpoint 'icinga01.example.net'.
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Executing check for 'w03.example.net!ping4'
[2017-02-23 09:45:39 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:39 +0000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'satellite02.example.net'
[2017-02-23 09:45:39 +0000] notice/Process: Running command '/usr/lib/nagios/plugins/check_ping' '-H' '136.243.73.41' '-c' '200,15%' '-w' '100,5%': PID 11769
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Check finished for object 'w03.example.net!ping4'
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Executing check for 'stats01.example.net!ping4'
[2017-02-23 09:45:39 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:39 +0000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'satellite02.example.net'
[2017-02-23 09:45:39 +0000] notice/Process: Running command '/usr/lib/nagios/plugins/check_ping' '-H' '138.201.20.69' '-c' '200,15%' '-w' '100,5%': PID 11771
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Check finished for object 'stats01.example.net!ping4'
[2017-02-23 09:45:39 +0000] information/Application: Got reload command: Starting new instance.
[2017-02-23 09:45:39 +0000] notice/Process: Running command '/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '-e' '/var/log/icinga2/error.log' '--reload-internal' '9364': PID 11773
[2017-02-23 09:45:39 +0000] information/Application: Received request to shut down.
[2017-02-23 09:45:40 +0000] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from 'satellite02.example.net'
[2017-02-23 09:45:40 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:40 +0000] notice/JsonRpcConnection: Received 'event::CheckResult' message from 'satellite02.example.net'
[2017-02-23 09:45:40 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:40 +0000] notice/ApiListener: Relaying 'event::CheckResult' message
[2017-02-23 09:45:40 +0000] information/Application: Shutting down...
[2017-02-23 09:45:40 +0000] information/CheckerComponent: Checker stopped.

configuration of all nodes seems to be correct

$ service icinga2 checkconfig
[ ok ] checking Icinga2 configuration.

The cluster setup is fairly simple, as I wrote before 2 masters, 2 satellites and 1 agent (I'm omitting endpoints configuration):

object Zone "example.net"  {
  endpoints = [ "satellite01.example.net", "satellite02.example.net", ]
  parent = "master"
}

object Zone "dev01.example.net"  {
  endpoints = [ "dev01.example.net", ]
  parent = "example.net"
}

object Zone "global-templates"  {
  global = true
}

object Zone "master"  {
  endpoints = [ "icinga01.example.net", "icinga02.example.net", ]
}

A workaround seems to be a satellite restart after each master change.

deric commented 7 years ago

I'm managing Icinga2's configuration with Puppet. It looks like the satellites crashes each time the two masters are out of sync. i.e. puppet agent run intervals on both masters would need to be synchronized in order to make this work.

dnsmichi commented 7 years ago

What do you mean by "two masters are out of sync"? You're not using the zones.d/ config sync mechanism are you?

deric commented 7 years ago

Both masters' configuration is (the same) and generated by Puppet. In case that primary master fails, the secondary should kick in. Is that wrong assumption?

Master's configuration looks like this:

├── zones.conf
└── zones.d
    ├── satellite
    │   ├── hosts.conf
    │   ├── services.conf
    │   └── system.conf
    ├── global-templates
    │   ├── groups.conf
    │   ├── hostgroups.conf
    │   ├── system.conf
    │   └── templates.conf
    ├── master

satellites are configured via config sync mechanism.

Puppet agent is configured to run periodically (each 30min) but it's not guaranteed that both masters would be updated at the same time.

dnsmichi commented 7 years ago

That way you'll end up with two config masters in a zone which leads to unreliable effects between both and might also explain why your satellite is going nuts. As explained in the distributed monitoring docs, you need to ensure that there's only one config master inside your master zone. That's done by populating zones.d/ with content. The secondary master must not have the config put there.

deric commented 7 years ago

Sorry, I've overlooked that. I've simplified secondary master configuration - basically it's empty now. Which features are needed for secondary master? Currently I have:

api checker debuglog influxdb mainlog

icinga2 daemon -C:

information/cli: Icinga application loader (version: r2.6.2-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
information/ApiListener: My API identity: icinga01.example.net
information/ConfigItem: Instantiated 1 ApiUser.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 4 Zones.
information/ConfigItem: Instantiated 2 FileLoggers.
information/ConfigItem: Instantiated 5 Endpoints.
information/ConfigItem: Instantiated 67 CheckCommands.
information/ConfigItem: Instantiated 43 Hosts.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 4 HostGroups.
information/ConfigItem: Instantiated 290 Services.
information/ConfigItem: Instantiated 3 ServiceGroups.
information/ConfigItem: Instantiated 1 CompatLogger.
information/ConfigItem: Instantiated 1 StatusDataWriter.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
information/ConfigItem: Instantiated 1 InfluxdbWriter.
information/ConfigItem: Instantiated 1 PerfdataWriter.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).

I've had there recursive group declaration, which probably caused the memory leak.

Right now it seems to be working, just systemd complains about satellite's processes:

icinga2.service: Supervising process 9908 which is not our child. We'll most likely not notice when it exits.
dnsmichi commented 7 years ago

That's the second process forked for executing plugins and commands, right?

deric commented 7 years ago

Yeah, those should be check commands.

Anyway, still after a single master update both satellites crashes and it takes two restarts to bring them up.

deric commented 7 years ago

The problem was, that due to files in zone.d the secondary master thought he was the leader, so it lead to a split-brain situation. Thanks for help!