Closed deric closed 7 years ago
Can you please check /var/log/icinga2 if there's any crash log available? I'd love to see the stack trace the SIGSEGV is causing.
If you are able to reliably reproduce the issue (e.g. by issuing a master config reload with new configuration applied) please add more configuration details (zones.conf, output of icinga2 --version
and icinga2 daemon -C
on a master and the affected satellite).
There's no crash log. Standard log /var/log/icinga2/icinga2.log
ends always with following:
[2017-02-23 09:07:40 +0000] information/ApiListener: Updating configuration file: /var/lib/icinga2/api/zones/global-templates//.timestamp
[2017-02-23 09:07:40 +0000] information/ApiListener: Restarting after configuration change.
[2017-02-23 09:07:40 +0000] information/ApiListener: Finished sending replay log for endpoint 'agent01.example.net'.
[2017-02-23 09:07:43 +0000] information/Application: Got reload command: Starting new instance.
[2017-02-23 09:07:43 +0000] information/Application: Received request to shut down.
[2017-02-23 09:07:43 +0000] information/Application: Shutting down...
[2017-02-23 09:07:43 +0000] information/CheckerComponent: Checker stopped.
icinga2 --version
:
icinga2 - The Icinga 2 network monitoring daemon (version: r2.6.2-1)
Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Application information:
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
System information:
Platform: Debian GNU/Linux
Platform version: 8 (jessie)
Kernel: Linux
Kernel version: 3.16.0-4-amd64
Architecture: x86_64
Build information:
Compiler: GNU 4.9.2
Build host: smithers
The segfault above was caused by insufficient memory. I've realized that one container was configured only for 1GB, which wasn't enough for configuration reload. Now both satellites are configured with 4GB and the issue seems to be easily reproducible.
For example, when masters are down, satellites are still running. However when master comes online, satellite crashes:
[2017-02-23 09:45:39 +0000] information/ApiListener: Restarting after configuration change.
[2017-02-23 09:45:39 +0000] information/ApiListener: Replayed 3327 messages.
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replayed 3327 messages.
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1487843139
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replaying log: /var/lib/icinga2/api/log/1487843140
[2017-02-23 09:45:39 +0000] notice/ApiListener: Replayed 0 messages.
[2017-02-23 09:45:39 +0000] information/ApiListener: Finished sending replay log for endpoint 'icinga01.example.net'.
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Executing check for 'w03.example.net!ping4'
[2017-02-23 09:45:39 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:39 +0000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'satellite02.example.net'
[2017-02-23 09:45:39 +0000] notice/Process: Running command '/usr/lib/nagios/plugins/check_ping' '-H' '136.243.73.41' '-c' '200,15%' '-w' '100,5%': PID 11769
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Check finished for object 'w03.example.net!ping4'
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Executing check for 'stats01.example.net!ping4'
[2017-02-23 09:45:39 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:39 +0000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'satellite02.example.net'
[2017-02-23 09:45:39 +0000] notice/Process: Running command '/usr/lib/nagios/plugins/check_ping' '-H' '138.201.20.69' '-c' '200,15%' '-w' '100,5%': PID 11771
[2017-02-23 09:45:39 +0000] debug/CheckerComponent: Check finished for object 'stats01.example.net!ping4'
[2017-02-23 09:45:39 +0000] information/Application: Got reload command: Starting new instance.
[2017-02-23 09:45:39 +0000] notice/Process: Running command '/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '-e' '/var/log/icinga2/error.log' '--reload-internal' '9364': PID 11773
[2017-02-23 09:45:39 +0000] information/Application: Received request to shut down.
[2017-02-23 09:45:40 +0000] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from 'satellite02.example.net'
[2017-02-23 09:45:40 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:40 +0000] notice/JsonRpcConnection: Received 'event::CheckResult' message from 'satellite02.example.net'
[2017-02-23 09:45:40 +0000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2017-02-23 09:45:40 +0000] notice/ApiListener: Relaying 'event::CheckResult' message
[2017-02-23 09:45:40 +0000] information/Application: Shutting down...
[2017-02-23 09:45:40 +0000] information/CheckerComponent: Checker stopped.
configuration of all nodes seems to be correct
$ service icinga2 checkconfig
[ ok ] checking Icinga2 configuration.
The cluster setup is fairly simple, as I wrote before 2 masters, 2 satellites and 1 agent (I'm omitting endpoints configuration):
object Zone "example.net" {
endpoints = [ "satellite01.example.net", "satellite02.example.net", ]
parent = "master"
}
object Zone "dev01.example.net" {
endpoints = [ "dev01.example.net", ]
parent = "example.net"
}
object Zone "global-templates" {
global = true
}
object Zone "master" {
endpoints = [ "icinga01.example.net", "icinga02.example.net", ]
}
A workaround seems to be a satellite restart after each master change.
I'm managing Icinga2's configuration with Puppet. It looks like the satellites crashes each time the two masters are out of sync. i.e. puppet agent
run intervals on both masters would need to be synchronized in order to make this work.
What do you mean by "two masters are out of sync"? You're not using the zones.d/ config sync mechanism are you?
Both masters' configuration is (the same) and generated by Puppet. In case that primary master fails, the secondary should kick in. Is that wrong assumption?
Master's configuration looks like this:
├── zones.conf
└── zones.d
├── satellite
│ ├── hosts.conf
│ ├── services.conf
│ └── system.conf
├── global-templates
│ ├── groups.conf
│ ├── hostgroups.conf
│ ├── system.conf
│ └── templates.conf
├── master
satellites are configured via config sync mechanism.
Puppet agent is configured to run periodically (each 30min) but it's not guaranteed that both masters would be updated at the same time.
That way you'll end up with two config masters in a zone which leads to unreliable effects between both and might also explain why your satellite is going nuts. As explained in the distributed monitoring docs, you need to ensure that there's only one config master inside your master zone. That's done by populating zones.d/ with content. The secondary master must not have the config put there.
Sorry, I've overlooked that. I've simplified secondary master configuration - basically it's empty now. Which features are needed for secondary master? Currently I have:
api checker debuglog influxdb mainlog
icinga2 daemon -C
:
information/cli: Icinga application loader (version: r2.6.2-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
information/ApiListener: My API identity: icinga01.example.net
information/ConfigItem: Instantiated 1 ApiUser.
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 4 Zones.
information/ConfigItem: Instantiated 2 FileLoggers.
information/ConfigItem: Instantiated 5 Endpoints.
information/ConfigItem: Instantiated 67 CheckCommands.
information/ConfigItem: Instantiated 43 Hosts.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 4 HostGroups.
information/ConfigItem: Instantiated 290 Services.
information/ConfigItem: Instantiated 3 ServiceGroups.
information/ConfigItem: Instantiated 1 CompatLogger.
information/ConfigItem: Instantiated 1 StatusDataWriter.
information/ConfigItem: Instantiated 1 ExternalCommandListener.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
information/ConfigItem: Instantiated 1 InfluxdbWriter.
information/ConfigItem: Instantiated 1 PerfdataWriter.
information/ConfigItem: Instantiated 1 NotificationComponent.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).
I've had there recursive group declaration, which probably caused the memory leak.
Right now it seems to be working, just systemd complains about satellite's processes:
icinga2.service: Supervising process 9908 which is not our child. We'll most likely not notice when it exits.
That's the second process forked for executing plugins and commands, right?
Yeah, those should be check commands.
Anyway, still after a single master update both satellites crashes and it takes two restarts to bring them up.
The problem was, that due to files in zone.d
the secondary master thought he was the leader, so it lead to a split-brain situation.
Thanks for help!
I have a HA setup with 2 masters and 2 satellites. After updating configuration master, one satellite dies other stays in running state.
parent process becomes a zombie:
from systemd logs:
I'm running icinga satellite in a LXC container with memory limit of 2G (I hope that should be enough). The system runs on Debian 8 (Jessie) with latest icinga2 version (
2.6.2-1~debmon8+1
). On the host I've notice segfaults in the statellite's container: