Icinga / icinga-core

Icinga 1.x, the old core (EOL 31.12.2018)
GNU General Public License v2.0
45 stars 27 forks source link

Icinga daemon 'Caught SIGSEGV, shutting down...' post check-mk-livestatus update #1596

Closed chrisweeksnz closed 7 years ago

chrisweeksnz commented 7 years ago

For Ubuntu trusty (and likely other Ubuntu releases), the icinga daemon fails when logrotation occurs.

The log breadcrumb that helped identify the issue was in /var/log/icinga/icinga.log:

[1498824000] Caught SIGSEGV, shutting down...

This behaviour is almost certainly caused by a recent patch to check-mk, that was designed to fix exactly this issue for nagios 3.5, but has caused the issue to be seen in icinga (nagios 3.5 was previously segfaulting at logrotation). Please see the check-mk packaging and patching discussion here for details of how they have modified the check-mk package: https://bugs.launchpad.net/ubuntu/+source/check-mk/+bug/1372284.

I imagine the issue is a mismatch downtime.h between check-mk and icinga, as described in the launchpad bug.

For any Ubuntu trusty users that come across this issue, a temporary workaround should be to apt pin check-mk-livestatus at version 1.2.2p3-1 until a solution is identified.

dnsmichi commented 7 years ago

http://git.mathias-kettner.de/git/?p=omd.git;a=blob;f=packages/nagios/patches/0007-fix_downtime_struct.dif;h=af0e245b585e78c372a69d10c5e3b47ab64ad510;hb=HEAD looks like that Nagios 3.5.0 broke the NEB API in a minor version by changing the location of attributes inside structs. API changes without releasing a major version are generally dangerous. We tried that in the past with Icinga, but always rolled back.

The change was introduced in 2013 but wasn't fixed by Nagios itself since 3.x is deprecated.

https://github.com/NagiosEnterprises/nagioscore/commits/5ad639f20755a8f915273437a0c1fd6e72e953e3/include

Essentially this change breaks it: https://github.com/NagiosEnterprises/nagioscore/commit/3e721af9e1b1e1c6332f07799f9441b790b9cb70#diff-0649fd602b3d24e69683d065abfd94bdR49

CheckMK / OMD adopted these changes and is now built against a new NEB API from Nagios 3.5.0 but will break on older Nagios and also Icinga versions.

I don't think that Icinga should change anything here. The NEB API changes for 3.5.0 should be reverted, and sanitized for any core using the Nagios 3x compatible header files. Since OMD uses a highly patched Nagios Core already, there's a good chance to fix the downtime struct by moving the additional attribute to the end.