Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2k stars 574 forks source link

[dev.icinga.com #10391] Icinga2 segfault #3505

Closed icinga-migration closed 8 years ago

icinga-migration commented 9 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10391

Created by sudv on 2015-10-19 06:43:12 +00:00

Assignee: sudv Status: Closed (closed on 2015-12-17 09:36:18 +00:00) Target Version: (none) Last Update: 2015-12-17 09:36:18 +00:00 (in Redmine)

Icinga Version: 2.3.10
Backport?: Not yet backported
Include in Changelog: 1

Hi, I often get the following error:

>Oct 16 20:17:54 cmdb6 kernel: icinga2[24047]: segfault at 0 ip 00007fb20df7de24 sp 00007fb20e37b640 error 4 in libboost_thread-mt.so.1.53.0[7fb20df72000+15000] >Oct 16 20:17:54 cmdb6 systemd: icinga2.service: main process exited, code=killed, status=11/SEGV >Oct 16 20:17:54 cmdb6 systemd: Unit icinga2.service entered failed state. Cent OS 7, icinga v2.3.10 [root@cmdb6 log]# uname -a Linux cmdb6.odusb.so 3.10.0-229.14.1.el7.x86_64 #1 SMP Tue Sep 15 15:05:51 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

This bug was present in the older versions 2.3.9

>Sep 14 00:01:21 cmdb7 kernel: icinga2[2536]: segfault at 0 ip 00007f35e626ce24 sp 00007f35e666a640 error 4 in libboost_thread-mt.so.1.53.0[7f35e6261000+15000] >Sep 14 00:01:21 cmdb7 systemd: icinga2.service: main process exited, code=killed, status=11/SEGV >Sep 14 00:01:21 cmdb7 systemd: Unit icinga2.service entered failed state.

2.3.8

>Aug 19 06:02:35 cmdb6 kernel: icinga2[19876]: segfault at 0 ip 00007f5c5b8fde24 sp 00007f5c5bcfb640 error 4 in libboost_thread-mt.so.1.53.0[7f5c5b8f2000+15000] >Aug 19 06:02:35 cmdb6 systemd: icinga2.service: main process exited, code=killed, status=11/SEGV

If there is anything else I should provide, please let me know.

Attachments

icinga-migration commented 9 years ago

Updated by gbeutner on 2015-10-19 07:21:06 +00:00

How can this problem be reproduced?

Also, are there any files in /var/log/icinga2/crash?

icinga-migration commented 9 years ago

Updated by sudv on 2015-10-19 07:58:47 +00:00

The problem occurs after a few days of operation (icinga2, launched October 9 closed October 16).

I have carsh reports ( in /var/log/icinga2/crash/), but they are for other dates:

> -rw-r-r- 1 icinga icinga 9618 Oct 2 04:26 report.1443749180.524965 > -rw-r-r- 1 icinga icinga 9630 Oct 2 05:54 report.1443754455.293607

/var/log/icinga2/icinga2.log before crash: ...

>[2015-10-16 20:17:27 +0300] information/ApiClient: No messages for identity 'Sungurov-HP.odusb.so' have been received in the last 60 seconds. >[2015-10-16 20:17:27 +0300] warning/ApiClient: API client disconnected for identity 'Sungurov-HP.odusb.so' >[2015-10-16 20:17:27 +0300] information/ApiClient: No messages for identity 'Sungurov-HP.odusb.so' have been received in the last 60 seconds. >[2015-10-16 20:17:27 +0300] warning/ApiClient: API client disconnected for identity 'Sungurov-HP.odusb.so' >[2015-10-16 20:17:29 +0300] information/ApiListener: New client connection for identity 'Sungurov-HP.odusb.so' (unauthenticated) >[2015-10-16 20:17:32 +0300] critical/checker: Exception occured while checking 'cmdb4.odusb.so': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable' >[2015-10-16 20:17:34 +0300] critical/ApiListener: Cannot accept new connection. >[2015-10-16 20:17:35 +0300] information/ApiListener: New client connection for identity 'cmdb4.odusb.so' >[2015-10-16 20:17:35 +0300] information/ApiListener: Syncing global zone 'global-templates'. >[2015-10-16 20:17:37 +0300] warning/ApiClient: Error while sending JSON-RPC message for identity 'cmdb4.odusb.so' >[2015-10-16 20:17:37 +0300] warning/ApiClient: API client disconnected for identity 'cmdb4.odusb.so' >[2015-10-16 20:17:37 +0300] warning/ApiListener: Removing API client for endpoint 'cmdb4.odusb.so'. 1 API clients left. >[2015-10-16 20:17:39 +0300] information/ApiListener: New client connection for identity 'Sungurov-HP.odusb.so' (unauthenticated) >[2015-10-16 20:17:42 +0300] information/ApiClient: No messages for identity 'Sungurov-HP.odusb.so' have been received in the last 60 seconds. >[2015-10-16 20:17:42 +0300] warning/ApiClient: API client disconnected for identity 'Sungurov-HP.odusb.so' >[2015-10-16 20:17:42 +0300] information/ApiClient: No messages for identity 'Sungurov-HP.odusb.so' have been received in the last 60 seconds. >[2015-10-16 20:17:42 +0300] warning/ApiClient: API client disconnected for identity 'Sungurov-HP.odusb.so' >[2015-10-16 20:17:42 +0300] information/ApiClient: No messages for identity 'Sungurov-HP.odusb.so' have been received in the last 60 seconds. >[2015-10-16 20:17:42 +0300] warning/ApiClient: API client disconnected for identity 'Sungurov-HP.odusb.so' >[2015-10-16 20:17:44 +0300] information/ApiListener: New client connection for identity 'Sungurov-HP.odusb.so' (unauthenticated) >[2015-10-16 20:17:47 +0300] warning/ApiClient: Error while sending JSON-RPC message for identity 'proxy-odusb.odusb.so' >[2015-10-16 20:17:47 +0300] warning/ApiClient: API client disconnected for identity 'proxy-odusb.odusb.so' >[2015-10-16 20:17:47 +0300] warning/ApiListener: Removing API client for endpoint 'proxy-odusb.odusb.so'. 0 API clients left. >[2015-10-16 20:17:49 +0300] information/ApiListener: New client connection for identity 'Sungurov-HP.odusb.so' (unauthenticated) >[2015-10-16 20:17:52 +0300] information/ApiClient: Reconnecting to API endpoint 'proxy-odusb.odusb.so' via host 'proxy-odusb.odusb.so' and port '5665' >[2015-10-16 20:17:52 +0300] critical/checker: Exception occured while checking 'cmdb6.odusb.so!load': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable' >[2015-10-16 20:17:52 +0300] critical/checker: Exception occured while checking 'cmdb7.odusb.so!disk /': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable' >[2015-10-16 20:17:52 +0300] critical/checker: Exception occured while checking 'proxy-odusb.odusb.so': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable' >[2015-10-16 20:17:52 +0300] information/ApiListener: New client connection for identity 'proxy-odusb.odusb.so' >[2015-10-16 20:17:52 +0300] information/ApiListener: Syncing global zone 'global-templates'. >[2015-10-16 20:17:52 +0300] critical/ApiListener: Cannot connect to host 'proxy-odusb.odusb.so' on port '5665' >[2015-10-16 20:17:52 +0300] information/ApiListener: New client connection for identity 'proxy-odusb.odusb.so' >[2015-10-16 20:17:52 +0300] information/ApiListener: Syncing global zone 'global-templates'.

icinga-migration commented 9 years ago

Updated by sudv on 2015-10-19 08:04:36 +00:00

On the machines where the icinga2 operates as an agent, I restart it every day:

# cat /etc/crontab ... #icinga 1 0 * * * root /usr/local/etc/rc.d/icinga2 restart

there is this problem does not occur.

icinga-migration commented 9 years ago

Updated by sudv on 2015-10-23 03:00:04 +00:00

I have updated icinga to version 2.3.11 and got another error message

/var/log/messages

>Oct 23 05:15:28 cmdb6 kernel: icinga2[23039]: segfault at 0 ip 00007f4e98a5de24 sp 00007f4e98e5b640 error 4 in libboost_thread-mt.so.1.53.0[7f4e98a52000+15000] >Oct 23 05:15:28 cmdb6 systemd: icinga2.service: main process exited, code=killed, status=11/SEGV >

/var/log/icinga2/icinga.log

>[2015-10-23 05:15:19 +0300] information/ApiListener: New client connection for identity 'Sungurov-HP.odusb.so' (unauthenticated) >[2015-10-23 05:15:22 +0300] information/ApiListener: New client connection for identity 'cmdb8.odusb.so' >[2015-10-23 05:15:22 +0300] information/ApiListener: Syncing global zone 'global-templates'. >[2015-10-23 05:15:22 +0300] critical/ThreadPool: Exception thrown in event handler: >Error: boost::thread_resource_error: Resource temporarily unavailable > (0) libboost_thread-mt.so.1.53.0: void boost::throw_exception(boost::thread_resource_error const&) (+0x161) [0x7f4e98a62531] > (1) libbase.so: icinga::WorkQueue::Enqueue(boost::function<void ()> const&, bool) (+0x603) [0x7f4e98012213] > (2) libremote.so: icinga::ApiClient::SendMessage(boost::intrusive_ptr const&) (+0x216) [0x7f4e97740296] > (3) libremote.so: icinga::ApiListener::SendConfigUpdate(boost::intrusive_ptr const&) (+0x94b) [0x7f4e9774947b] > (4) libremote.so: icinga::ApiListener::NewClientHandler(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) (+0x2f4) [0x7f4e977545d4] > (5) libbase.so: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x308) [0x7f4e98035228] > (6) libboost_thread-mt.so.1.53.0: (+0xd24a) [0x7f4e98a5f24a] > (7) libpthread.so.0: (+0x7df5) [0x7f4e9568edf5] > (8) libc.so.6: clone (+0x6d) [0x7f4e95ba11ad] > > >[2015-10-23 05:15:24 +0300] information/ApiListener: New client connection for identity 'cmdb4.odusb.so' >[2015-10-23 05:15:24 +0300] information/ApiListener: Syncing global zone 'global-templates'. >[2015-10-23 05:15:24 +0300] critical/ThreadPool: Exception thrown in event handler: >Error: boost::thread_resource_error: Resource temporarily unavailable > (0) libboost_thread-mt.so.1.53.0: void boost::throw_exception(boost::thread_resource_error const&) (+0x161) [0x7f4e98a62531] > (1) libbase.so: icinga::WorkQueue::Enqueue(boost::function<void ()> const&, bool) (+0x603) [0x7f4e98012213] > (2) libremote.so: icinga::ApiClient::SendMessage(boost::intrusive_ptr const&) (+0x216) [0x7f4e97740296] > (3) libremote.so: icinga::ApiListener::SendConfigUpdate(boost::intrusive_ptr const&) (+0x94b) [0x7f4e9774947b] > (4) libremote.so: icinga::ApiListener::NewClientHandler(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) (+0x2f4) [0x7f4e977545d4] > (5) libbase.so: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x308) [0x7f4e98035228] > (6) libboost_thread-mt.so.1.53.0: (+0xd24a) [0x7f4e98a5f24a] > (7) libpthread.so.0: (+0x7df5) [0x7f4e9568edf5] > (8) libc.so.6: clone (+0x6d) [0x7f4e95ba11ad] > > >[2015-10-23 05:15:24 +0300] information/Checkable: Checking for configured notifications for object 'vCSA01-Bur.odusb.so!vCenter VM-tools' >[2015-10-23 05:15:24 +0300] information/Checkable: Checkable 'vCSA01-Bur.odusb.so!vCenter VM-tools' does not have any notifications. >[2015-10-23 05:15:24 +0300] information/ApiListener: New client connection for identity 'Sungurov-HP.odusb.so' (unauthenticated) >[2015-10-23 05:15:24 +0300] critical/ThreadPool: Exception thrown in event handler: >Error: boost::thread_resource_error: Resource temporarily unavailable > (0) libboost_thread-mt.so.1.53.0: void boost::throw_exception(boost::thread_resource_error const&) (+0x161) [0x7f4e98a62531] > (1) libbase.so: icinga::WorkQueue::Enqueue(boost::function<void ()> const&, bool) (+0x603) [0x7f4e98012213] > (2) libremote.so: icinga::ApiClient::SendMessage(boost::intrusive_ptr const&) (+0x216) [0x7f4e97740296] > (3) libremote.so: icinga::ApiListener::ApiTimerHandler() (+0x15e0) [0x7f4e9774d790] > (4) libbase.so: boost::signals2::detail::signal1_impl<void, boost::intrusive_ptr const&, boost::signals2::optional_last_value, int, std::less, boost::function>(boost::intrusive_ptr const&)>, boost::function<void (boost::signals2::connection const&, boost::intrusive_ptr const&)>, boost::signals2::mutex>::operator()(boost::intrusive_ptr >>const&) (+0x1bb) [0x7f4e980844ab] > (5) libbase.so: icinga::Timer::Call() (+0x34) [0x7f4e98037134] > (6) libbase.so: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x308) [0x7f4e98035228] > (7) libboost_thread-mt.so.1.53.0: (+0xd24a) [0x7f4e98a5f24a] > (8) libpthread.so.0: (+0x7df5) [0x7f4e9568edf5] > (9) libc.so.6: clone (+0x6d) [0x7f4e95ba11ad] > > >[2015-10-23 05:15:25 +0300] critical/checker: Exception occured while checking 'cmdb4.odusb.so!disk /': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable' >[2015-10-23 05:15:27 +0300] critical/ApiListener: Cannot accept new connection. >[2015-10-23 05:15:27 +0300] information/ApiListener: New client connection for identity 'cmdb8.odusb.so' >[2015-10-23 05:15:27 +0300] information/ApiListener: Syncing global zone 'global-templates'. >[2015-10-23 05:15:27 +0300] critical/ThreadPool: Exception thrown in event handler: >Error: boost::thread_resource_error: Resource temporarily unavailable > (0) libboost_thread-mt.so.1.53.0: void boost::throw_exception(boost::thread_resource_error const&) (+0x161) [0x7f4e98a62531] > (1) libbase.so: icinga::WorkQueue::Enqueue(boost::function<void ()> const&, bool) (+0x603) [0x7f4e98012213] > (2) libremote.so: icinga::ApiClient::SendMessage(boost::intrusive_ptr const&) (+0x216) [0x7f4e97740296] > (3) libremote.so: icinga::ApiListener::SendConfigUpdate(boost::intrusive_ptr const&) (+0x94b) [0x7f4e9774947b] > (4) libremote.so: icinga::ApiListener::NewClientHandler(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) (+0x2f4) [0x7f4e977545d4] > (5) libbase.so: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x308) [0x7f4e98035228] > (6) libboost_thread-mt.so.1.53.0: (+0xd24a) [0x7f4e98a5f24a] > (7) libpthread.so.0: (+0x7df5) [0x7f4e9568edf5] > (8) libc.so.6: clone (+0x6d) [0x7f4e95ba11ad] >

icinga-migration commented 8 years ago

Updated by sulhan on 2015-10-24 09:51:06 +00:00

Just trying to help here, I'm not an expert in Icinga, so most of this is just black-box guessing.

This problem probably happened because no more or not enough memory left in your system.

When ApiListener want to listen for connection, its trying to detach from main program by using fork (I have no idea why they use fork instead of select(), that is beyond my knowledge).

[2015-10-16 20:17:32 +0300] critical/checker: Exception occured while checking 'cmdb4.odusb.so': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable'

Fork is 1:1 copy of program, means if program use N memory, the fork will use the same amount of memory. You can search for "fork error 11".

Since no more memory left, program can't create new thread too,

(1) libbase.so: icinga::WorkQueue::Enqueue(boost::function const&, bool) (+0x603) [0x7f4e98012213]

If this is the case, Icinga daemon should handle the exception better when creating fork or thread failed.

EDIT: which make me thinking, probably there is memory leak in the program, since curv said "On the machines where the icinga2 operates as an agent, I restart it every day ... there is this problem does not occur.".

icinga-migration commented 8 years ago

Updated by mfriedrich on 2015-10-26 16:02:56 +00:00

ApiListener does not call fork(), the error message originates from a different thread used by the Checker class to execute a check.

It seems that there are not more resources available on your system which would indicate a memory leak.

Can you please run icinga2 using gdb and create a full backtrace? (Details: http://docs.icinga.org/icinga2/latest/doc/module/icinga2/chapter/debug#development-debug-gdb-backtrace)

icinga-migration commented 8 years ago

Updated by sudv on 2015-10-27 06:35:14 +00:00

1) I increased the amount of memory in my computer 2 times (8 GB -> 16 GB) 2) I try to deal with the debugger and create full backtrace

Configuration possibilities of the program are wonderful, thank you for icinga!

icinga-migration commented 8 years ago

Updated by sudv on 2015-10-28 02:29:17 +00:00

Increasing the memory has not solved the problem:

>... >Oct 27 12:04:01 cmdb6 icinga2: [2015-10-27 12:04:01 +0300] information/cli: Icinga application loader (version: v2.3.11) >Oct 27 12:04:01 cmdb6 icinga2: [2015-10-27 12:04:01 +0300] information/cli: Loading application type: icinga/IcingaApplication >Oct 27 12:04:01 cmdb6 icinga2: [2015-10-27 12:04:01 +0300] information/Utility: Loading library 'libicinga.so' >Oct 27 12:04:01 cmdb6 icinga2: [2015-10-27 12:04:01 +0300] information/ApiListener: My API identity: cmdb6.odusb.so >... >Oct 28 05:01:57 cmdb6 kernel: icinga2[29813]: segfault at 0 ip 00007f231c8d6e24 sp 00007f231ccd4640 error 4 in libboost_thread-mt.so.1.53.0[7f231c8cb000+15000] >Oct 28 05:01:58 cmdb6 systemd: icinga2.service: main process exited, code=killed, status=11/SEGV >Oct 28 05:01:58 cmdb6 systemd: Unit icinga2.service entered failed state. >... >

I'll use the debugger

icinga-migration commented 8 years ago

Updated by sudv on 2015-10-28 06:14:09 +00:00

I launched the debugger, run the program and watch the endless stream of rows.

Error message, can i ignore it?

>[root@cmdb6 ~]# gdb --args /usr/sbin/icinga2 daemon -x debug -DUseVfork=0 >GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7 >Copyright © 2013 Free Software Foundation, Inc. >License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html&gt; >This is free software: you are free to change and redistribute it. >There is NO WARRANTY, to the extent permitted by law. Type "show copying" >and "show warranty" for details. >This GDB was configured as "x86_64-redhat-linux-gnu". >For bug reporting instructions, please see: ><http://www.gnu.org/software/gdb/bugs/&gt;... >Traceback (most recent call last): > File "", line 3, in > File "/root/.gdb_printers/Boost-Pretty-Printer/boost_print/init.py", line 41, in > from .common import register_printers, add_trivial_printer > File "/root/.gdb_printers/Boost-Pretty-Printer/boost_print/common.py", line 113 > file=sys.stderr) > ^ >SyntaxError: invalid syntax >/root/.gdbinit:25: Error in sourced command file: >Error while executing Python code. >Reading symbols from /usr/sbin/icinga2...Reading symbols from /usr/lib/debug/usr/sbin/icinga2.debug...done. >done. >

icinga-migration commented 8 years ago

Updated by mfriedrich on 2015-11-25 09:41:47 +00:00

Did you invoke "r" to actually run the debugged program?

icinga-migration commented 8 years ago

Updated by sudv on 2015-11-30 02:20:59 +00:00

I installed on computers with windows 7 program as an agent and set up a connection to the server. Later, I reinstalled the program on the server. Communication with the agents on computers with windows 7 I have not restored. Agents on these workstations are constantly trying to contact the server to establish a connection. Once I disabled the agents on the workstations, the program began to work on the server is stable (a month of work without a error). Probably caused crashes code handles the connection to the server.

Thank you to everyone who tried to help solve the problem!

icinga-migration commented 8 years ago

Updated by mfriedrich on 2015-11-30 11:11:12 +00:00

Does that mean that the original problem is now solved and/or not reproducible anymore?

icinga-migration commented 8 years ago

Updated by sudv on 2015-12-01 03:00:29 +00:00

My problem is solved. Probably, the program (function of establishing a connection) has a bug that caused memory leaks.

icinga-migration commented 8 years ago

Updated by mfriedrich on 2015-12-17 09:36:19 +00:00