Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2k stars 575 forks source link

Icinga2 startup fails, if network stack is not fully loaded. #6758

Closed jschanz closed 5 years ago

jschanz commented 5 years ago

Icinga2 startup fails, if network stack is not fully loaded. Not sure, if this is a systemd or icinga2 related problem.

Icinga2 can't determine the FQDN of the host, if the startup of the network stack tooks longer than usual (e.g. if you use a brdige and several network interfaces.

Icinga2 does a fallback or could only get the hostname, but not the domain of the host, and fails while loading the certs to startup.

2018-11-07T17:08:45.232722+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:45 +0100] critical/SSL: Error on bio X509 AUX reading pem file '/var/lib/icinga2/certs//icinga-01.crt': 33558530, "error:02001002:lib(2):func(1):reason(2)"
2018-11-07T17:08:45.256520+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:45 +0100] critical/config: Error: Cannot get certificate from cert path: '/var/lib/icinga2/certs//icinga-01.crt'.

hostname is "icinga-01" domain is "localdomain.local" fqdn is "icinga-01.localdomain.local"

certs are stored with fqdn naming scheme

icinga-01.localdomain.local:/etc/sysconfig/network # ll /var/lib/icinga2/certs/
insgesamt 16
-rw-rw---- 1 icinga icinga 1720 25. Okt 06:37 ca.crt
-rw-rw---- 1 icinga icinga 1785 25. Okt 06:37 trusted-master.crt
-rw-rw---- 1 icinga icinga 1777 25. Okt 06:37 icinga-01.localdomain.local.crt
-rw------- 1 icinga icinga 3243 25. Okt 06:37 icinga-01.localdomain.local.key

full log of initialization ...

2018-11-07T17:08:44.121362+01:00 icinga-01 systemd[1]: Starting system-network.slice.
2018-11-07T17:08:44.121581+01:00 icinga-01 systemd[1]: Created slice system-network.slice.
2018-11-07T17:08:44.126479+01:00 icinga-01 systemd[1]: Starting ifup managed network interface eth0...
2018-11-07T17:08:44.166293+01:00 icinga-01 ifup[1363]: eth0      device: Intel Corporation 82578DM Gigabit Network Connection (rev 05)
2018-11-07T17:08:44.167068+01:00 icinga-01 ifup[1363]:     eth0      device: Intel Corporation 82578DM Gigabit Network Connection (rev 05)
2018-11-07T17:08:44.437492+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:44 +0100] information/cli: Icinga application loader (version: r2.10.1-1)
2018-11-07T17:08:44.437715+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:44 +0100] information/cli: Loading configuration file(s).
2018-11-07T17:08:44.706756+01:00 icinga-01 kernel: [   24.758137] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
2018-11-07T17:08:44.807769+01:00 icinga-01 kernel: [   24.859013] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X
2018-11-07T17:08:44.807793+01:00 icinga-01 kernel: [   24.859162] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
2018-11-07T17:08:44.948371+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:44 +0100] information/ConfigItem: Committing config item(s).
2018-11-07T17:08:45.232722+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:45 +0100] critical/SSL: Error on bio X509 AUX reading pem file '/var/lib/icinga2/certs//icinga-01.crt': 33558530, "error:02001002:lib(2):func(1):reason(2)"
2018-11-07T17:08:45.256520+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:45 +0100] critical/config: Error: Cannot get certificate from cert path: '/var/lib/icinga2/certs//icinga-01.crt'.
2018-11-07T17:08:45.256679+01:00 icinga-01 icinga2[788]: Location: in /etc/icinga2/icinga2.conf: 29:1-29:24
2018-11-07T17:08:45.256830+01:00 icinga-01 icinga2[788]: /etc/icinga2/icinga2.conf(27): }
2018-11-07T17:08:45.256967+01:00 icinga-01 icinga2[788]: /etc/icinga2/icinga2.conf(28):
2018-11-07T17:08:45.257101+01:00 icinga-01 icinga2[788]: /etc/icinga2/icinga2.conf(29): object ApiListener "api" {
2018-11-07T17:08:45.257237+01:00 icinga-01 icinga2[788]: ^^^^^^^^^^^^^^^^^^^^^^^^
2018-11-07T17:08:45.257372+01:00 icinga-01 icinga2[788]: /etc/icinga2/icinga2.conf(30):   accept_commands = true
2018-11-07T17:08:45.257507+01:00 icinga-01 icinga2[788]: /etc/icinga2/icinga2.conf(31):   accept_config = true
2018-11-07T17:08:45.257649+01:00 icinga-01 icinga2[788]: [2018-11-07 17:08:45 +0100] critical/config: 1 error
2018-11-07T17:08:45.268125+01:00 icinga-01 systemd[1]: icinga2.service: main process exited, code=exited, status=1/FAILURE
2018-11-07T17:08:45.269265+01:00 icinga-01 systemd[1]: Failed to start Icinga host/service/network monitoring system.
2018-11-07T17:08:45.269488+01:00 icinga-01 systemd[1]: Unit icinga2.service entered failed state.
2018-11-07T17:08:46.159716+01:00 icinga-01 kernel: [   26.212521] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
2018-11-07T17:08:46.159734+01:00 icinga-01 kernel: [   26.212634] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
2018-11-07T17:08:46.159735+01:00 icinga-01 kernel: [   26.212670] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
2018-11-07T17:08:46.234230+01:00 icinga-01 systemd[1]: Started ifup managed network interface eth0.
2018-11-07T17:08:46.268042+01:00 icinga-01 systemd[1]: Expecting device sys-subsystem-net-devices-br0.device...
2018-11-07T17:08:46.268271+01:00 icinga-01 systemd[1]: Starting ifup managed network interface br0...
2018-11-07T17:08:46.321031+01:00 icinga-01 ifup[1942]: br0
2018-11-07T17:08:46.321739+01:00 icinga-01 ifup[1942]:     br0
2018-11-07T17:08:46.362713+01:00 icinga-01 kernel: [   26.415058] Bridge firewalling registered
2018-11-07T17:08:46.370519+01:00 icinga-01 ifup[1942]: br0       Ports: [eth0]
2018-11-07T17:08:46.370799+01:00 icinga-01 kernel: [   26.423412] device eth0 entered promiscuous mode
2018-11-07T17:08:46.373718+01:00 icinga-01 kernel: [   26.426703] br0: port 1(eth0) entered forwarding state
2018-11-07T17:08:46.373725+01:00 icinga-01 kernel: [   26.426707] br0: port 1(eth0) entered forwarding state
2018-11-07T17:08:46.374563+01:00 icinga-01 ifup-bridge[2019]:     br0       forwarddelay (see man ifcfg-bridge)
2018-11-07T17:08:46.375138+01:00 icinga-01 ifup[1942]: br0       forwarddelay (see man ifcfg-bridge) ... ready
2018-11-07T17:08:46.375375+01:00 icinga-01 systemd-sysctl[2047]: Overwriting earlier assignment of kernel/sysrq in file '/etc/sysctl.d/99-sysctl.conf'.
2018-11-07T17:08:46.376159+01:00 icinga-01 systemd[1]: Found device /sys/subsystem/net/devices/br0.
2018-11-07T17:08:46.378135+01:00 icinga-01 ifup-bridge[2019]: ... ready
2018-11-07T17:08:46.492138+01:00 icinga-01 systemd[1]: Started ifup managed network interface br0.
2018-11-07T17:08:46.504507+01:00 icinga-01 network[817]: ..done..done..doneSetting up service network  .  .  .  .  .  .  .  .  .  .  .  .  ...done
2018-11-07T17:08:46.505711+01:00 icinga-01 systemd[1]: Started LSB: Configure network interfaces and set up routing.
2018-11-07T17:08:46.505891+01:00 icinga-01 systemd[1]: Starting Network.
2018-11-07T17:08:46.507845+01:00 icinga-01 systemd[1]: Reached target Network.

If you do a restart after system is fully started, everything works as expected and the service is started.

Expected Behavior

Shouldn't fail

Current Behavior

Fails sometimes, if initialization of network stack is slow

Possible Solution

Steps to Reproduce (for bugs)

Not reproducible everytime, because sometimes it works, sometimes not.

Your Environment

Copyright (c) 2012-2018 Icinga Development Team (https://icinga.com/) License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

System information: Platform: openSUSE Platform version: 13.1 (Bottle) Kernel: Linux Kernel version: 3.11.10-29-desktop Architecture: i686

Build information: Compiler: GNU 4.8.1 Build host: server342vmx

Application information:

General paths: Config directory: /etc/icinga2 Data directory: /var/lib/icinga2 Log directory: /var/log/icinga2 Cache directory: /var/cache/icinga2 Spool directory: /var/spool/icinga2 Run directory: /var/run/icinga2

Old paths (deprecated): Installation root: /usr Sysconf directory: /etc Run directory (base): /var/run Local state directory: /var

Internal paths: Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /var/run/icinga2/icinga2.pid

openSUSE 13.1 (i586) VERSION = 13.1 CODENAME = Bottle

jschanz commented 5 years ago

I think it has something to do with name resolution. If no entry is set in /etc/hosts, getaddrinfo fails without network. If a entry is set in /etc/host, FQDN is set, also without network. Maybee it's only a documentation update to set an entry in /etc/hosts, which I could also do later.

jschanz commented 5 years ago

I'll get also messages like these:

2018-11-08T03:05:51.734071+01:00 icinga-01 icinga2[694]: [2018-11-08 03:05:51 +0100] critical/TcpSocket: getaddrinfo() failed with error code -2, "Name or service not known"
2018-11-08T03:05:51.747005+01:00 icinga-01 icinga2[694]: [2018-11-08 03:05:51 +0100] critical/TcpSocket: getaddrinfo() failed with error code -2, "Name or service not known"
Crunsher commented 5 years ago

I looked this up yesterday: At startup Icinga calls getaddrinfo to get the FQDN, if that fails hostname and if that fails it uses 'localhost'.

I don't think there is anything we can do about this either, except document it :woman_shrugging:

dgoetz commented 5 years ago

Just to ensure, @jschanz can you show the content of the systemd icinga2.service unit?

It should contain After=... network-online.target ..., which should be enough. If it is not enough like in your case ensure the wait daemon corresponding the network managing daemon is enabled (systemctl is-enabled NetworkManager-wait-online.service systemd-networkd-wait-online.service). If this is not enough I would say it is a problem of this daemon instead of Icinga 2.

Have a look for further details at https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/

jschanz commented 5 years ago

@dgoetz

[Unit]
Description=Icinga host/service/network monitoring system
After=syslog.target network-online.target postgresql.service mariadb.service carbon-cache.service carbon-relay.service

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/icinga2
ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2
ExecStart=/usr/sbin/icinga2 daemon -e /var/log/icinga2/error.log
PIDFile=/var/run/icinga2/icinga2.pid
ExecReload=/usr/lib/icinga2/safe-reload /etc/sysconfig/icinga2
TimeoutStartSec=30m

# Systemd >228 enforces a lower process number for services.
# Depending on the distribution and Systemd version, this must
# be explicitly raised. Packages will set the needed values
# into /etc/systemd/system/icinga2.service.d/limits.conf
#
# Please check the troubleshooting documentation for further details.
# The values below can be used as examples for customized service files.

#TasksMax=infinity
#LimitNPROC=62883

[Install]
WantedBy=multi-user.target

Target "network" is reached after icinga2 start:

2018-11-07T17:08:46.507845+01:00 icinga-01 systemd[1]: Reached target Network.

but

2018-11-07T17:08:45.269265+01:00 icinga-01 systemd[1]: Failed to start Icinga host/service/network monitoring system.

I can reproduce this now ... Please unplug the network cable and try to use the following /etc/hosts

#
# hosts         This file describes a number of hostname-to-address
#               mappings for the TCP/IP subsystem.  It is mostly
#               used at boot time, when no name servers are running.
#               On small systems, this file can be used instead of a
#               "named" name server.
# Syntax:
#    
# IP-Address  Full-Qualified-Hostname  Short-Hostname
#
127.0.0.1   localhost.localdomain localhost

So no adress resultion (local, dns, etc.) is possible. Icinga2 is unable to determine the FQDN with getaddrinfo and fails while looking up for the certs in /var/lib/icinga2/certs/ and won't start due to that.

jschanz commented 5 years ago

Tested on SLES and OpenSUSE. Needs more testing in other environments. Remove entry with from /etc/hosts and reboot. Icinga2-Service should start after successful network initialization now.

dgoetz commented 5 years ago

I tried to reproduce on CentOS 7. On CentOS7 with NetworkManager.service and NetworkManager-wait-online.service enabled Icinga 2 is always started after networking. Enabling the old network.service and disabling NetworkManager.service and NetworkManager-wait-online.service gave me the same problem. Disabling network.service and only enabling NetworkManager.service also did not cause a problem. So it is totally depending on the network managing service.

With an additional Requires it also works for network.service only. While the Requires can delay start up of the system, I would say lets add it.