Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

Icinga2 Segmantation Fault #6520

Closed firatalkis closed 6 years ago

firatalkis commented 6 years ago

We are using Icinga2 version r2.9.1-1 and runs on VM (Rethat 7.5 - Maipo).In our arhitecture we have 1 master and 9 slave servers. The Icinga2 service ,which is installed on the one of our slave server, crashes frequently. When we check the messages.log, we can see this pattern : SIGSEGV. We followed the gdp steps like docs said and get the below results. If anyone has same issue, plz share your comments.

icinga2.zip

icinga2.log at attachment,

GDB Output

[root@hostname cores]# gdb /usr/lib64/icinga2/sbin/icinga2 core.icinga2.36862 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright © 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later -gnu.org/licenses/gpl.html- This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type “show copying” and “show warranty” for details. This GDB was configured as “x86_64-redhat-linux-gnu”. For bug reporting instructions, please see: -gnu.org/software/gdb/bugs/-… Reading symbols from /usr/lib64/icinga2/sbin/icinga2…Reading symbols from /usr/lib64/icinga2/sbin/icinga2…(no debugging symbols found)…done. (no debugging symbols found)…done. [New LWP 93781] [New LWP 93799] [New LWP 91795] [New LWP 94443] [New LWP 93751] [New LWP 94342] [New LWP 93754] [New LWP 93757] [New LWP 94346] [New LWP 93752] [New LWP 36999] [New LWP 93770] [New LWP 93800] [New LWP 93763] [New LWP 39430] [New LWP 93798] [New LWP 105815] [New LWP 36862] [New LWP 105818] [New LWP 93766] [New LWP 105814] [New LWP 93765] [New LWP 61141] [New LWP 124709] [New LWP 93724] [New LWP 124608] [New LWP 93756] [New LWP 93750] [New LWP 94306] [New LWP 105817] [New LWP 94254] [New LWP 33679] [New LWP 93755] [New LWP 93764] [New LWP 93753] [New LWP 93801] [Thread debugging using libthread_db enabled] Using host libthread_db library “/lib64/libthread_db.so.1”. Core was generated by `/usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/er’. Program terminated with signal 11, Segmentation fault.

0 0x00002ba91e938d58 in std::basic_string-char, std::char_traits-char-, std::allocator-char- -::basic_string(std::string const&) () from /lib64/libstdc++.so.6

Missing separate debuginfos, use: debuginfo-install icinga2-bin-2.9.1-1.el7.icinga.x86_64 (gdb) bt

0 0x00002ba91e938d58 in std::basic_string-char, std::char_traits-char-, std::allocator-char- -::basic_string(std::string const&) () from /lib64/libstdc++.so.6

1 0x0000000000960225 in icinga::Comment::RemoveComment(icinga::String const&, boost::intrusive_ptr-icinga::MessageOrigin- const&) ()

2 0x00000000008a0cf6 in icinga::Checkable::RemoveCommentsByType(int) ()

3 0x0000000000a10364 in icinga::Checkable::ProcessCheckResult(boost::intrusive_ptr-icinga::CheckResult- const&, boost::intrusive_ptr-icinga::MessageOrigin- const&)

()

4 0x0000000000a1f3d1 in icinga::ClusterEvents::CheckResultAPIHandler(boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&) ()

5 0x000000000078f69f in std::_Function_handler-icinga::Value (boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&), icinga::Value (*)(boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&)-::_M_invoke(std::_Any_data const&, boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&) ()

6 0x00000000009b4923 in icinga::JsonRpcConnection::MessageHandler(icinga::String const&) ()

7 0x00000000009b54ab in icinga::JsonRpcConnection::MessageHandlerWrapper(icinga::String const&) ()

8 0x000000000071f469 in icinga::WorkQueue::RunTaskFunction(std::function-void ()- const&) ()

9 0x000000000073f0f7 in icinga::WorkQueue::WorkerThreadProc() ()

10 0x00002ba91d18d27a in thread_proxy () from /lib64/libboost_thread-mt.so.1.53.0

11 0x00002ba91f0a0dd5 in start_thread () from /lib64/libpthread.so.0

12 0x00002ba91f3b3b3d in clone () from /lib64/libc.so.6

(gdb)

Crunsher commented 6 years ago

There should be a crashlog with the other logs, could you provide that one as well please?

firatalkis commented 6 years ago

crash log did not occur in crash/ directory. what can I do to create a crash log? I couldn't find anything in the documentation about that

Crunsher commented 6 years ago

Interesting, a crash log should always be written. If it's not that's at least a hint ^_^

dnsmichi commented 6 years ago

For some reason, the check result processed here puts the checkable into a state of Recovery. This triggers the removal of the Acknowledgement.

For some reason, there are not comments associated to this acknowledgement. This would lead into a broken cluster where one node has a broken API package and not all the comments loaded.

Still, it shouldn't crash just by that.

firatalkis commented 6 years ago

the problem continues. the latest gdb below. you have any other suggestions?

GDB OutPut

[root@cluster1 cores]# gdb /usr/lib64/icinga2/sbin/icinga2 core.icinga2.26932 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/lib64/icinga2/sbin/icinga2...Reading symbols from /usr/lib64/icinga2/sbin/icinga2...(no debugging symbols found)...done. (no debugging symbols found)...done. [New LWP 26950] [New LWP 51291] [New LWP 51265] [New LWP 51922] [New LWP 51292] [New LWP 26951] [New LWP 51280] [New LWP 51290] [New LWP 51287] [New LWP 51288] [New LWP 51299] [New LWP 51297] [New LWP 51279] [New LWP 107174] [New LWP 51274] [New LWP 51834] [New LWP 107145] [New LWP 107195] [New LWP 26949] [New LWP 51833] [New LWP 51298] [New LWP 57062] [New LWP 51276] [New LWP 51277] [New LWP 51281] [New LWP 27345] [New LWP 51275] [New LWP 51296] [New LWP 51278] [New LWP 74163] [New LWP 51818] [New LWP 51289] [New LWP 51850] [New LWP 26932] [New LWP 51256] [New LWP 107190] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/er'. Program terminated with signal 11, Segmentation fault.

0 0x00002b514cce7d58 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(std::string const&) () from /lib64/libstdc++.so.6

Missing separate debuginfos, use: debuginfo-install icinga2-bin-2.9.1-1.el7.icinga.x86_64 (gdb) bt

0 0x00002b514cce7d58 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(std::string const&) () from /lib64/libstdc++.so.6

1 0x0000000000960225 in icinga::Comment::RemoveComment(icinga::String const&, boost::intrusive_ptr const&) ()

2 0x00000000008a0cf6 in icinga::Checkable::RemoveCommentsByType(int) ()

3 0x0000000000a10364 in icinga::Checkable::ProcessCheckResult(boost::intrusive_ptr const&, boost::intrusive_ptr const&) ()

4 0x0000000000a16d0b in icinga::PluginCheckTask::ProcessFinishedHandler(boost::intrusive_ptr const&, boost::intrusive_ptr const&, icinga::Value const&, icinga::ProcessResult const&) ()

5 0x0000000000806cfa in icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) ()

6 0x00002b514b53c27a in thread_proxy () from /lib64/libboost_thread-mt.so.1.53.0

7 0x00002b514d44fdd5 in start_thread () from /lib64/libpthread.so.0

8 0x00002b514d762b3d in clone () from /lib64/libc.so.6

(gdb)

/var/log/messages

Aug 14 15:24:33 cluster1 kernel: [1914456.090348] icinga2[26950]: segfault at 48 ip 00002b514cce7d58 sp 00002b5152af5430 error 4 in libstdc++.so.6.0.19[2b514cc29000+e9000] Aug 14 15:24:34 cluster1 systemd[1]: icinga2.service: main process exited, code=killed, status=11/SEGV Aug 14 15:24:34 cluster1 systemd[1]: Unit icinga2.service entered failed state. Aug 14 15:24:34 cluster1 systemd[1]: icinga2.service failed.

Crunsher commented 6 years ago

Looking at the code I'm uncertain how RemoveComment could fail in such a spectacular way. We are going to need a way to reproduce this.

ghost commented 6 years ago

Hello, we are getting a lot of segfaults since we upgraded from 2.8 to 2.9.1. All servers are CentOS 7.5, and all are having the same issues. Client nodes just randomly die. No crash logs or anything useful that we could find.

[7468808.154021] icinga2[21086]: segfault at 7ff9a44b2dc0 ip 00007ff9a120362c sp 00007ffc65ac92f0 error 4 in libc-2.17.so[7ff9a1183000+1c3000]

kernel 3.10.0-862.3.2.el7.x86_64

` ============== GENERAL INFORMATION ==============

Application version: r2.9.1-1
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

Enabled features: api mainlog Disabled features: checker command compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb perfdata statusdata syslog

######################## checker is disabled, no checks can be run from this instance ########################

######################## debuglog is disabled, please activate it and rerun icinga2 ########################

============== OBJECT INFORMATION ==============

Checking object file from /var/cache/icinga2/icinga2.debug Found the 248 objects: Type : Count ApiListener : 1 ApiUser : 1 CheckCommand : 238 Endpoint : 2 FileLogger : 1 IcingaApplication : 1 Zone : 4

The objects origins are:

/etc/icinga2/conf.d/api-users.conf /etc/icinga2/conf.d/commands.conf /etc/icinga2/features-enabled/api.conf /etc/icinga2/features-enabled/mainlog.conf /etc/icinga2/zones.conf /usr/share/icinga2/include/command-icinga.conf /usr/share/icinga2/include/command-nscp-local.conf /usr/share/icinga2/include/command-plugins-manubulon.conf /usr/share/icinga2/include/command-plugins.conf /usr/share/icinga2/include/plugins-contrib.d/databases.conf /usr/share/icinga2/include/plugins-contrib.d/hardware.conf /usr/share/icinga2/include/plugins-contrib.d/icingacli.conf /usr/share/icinga2/include/plugins-contrib.d/ipmi.conf /usr/share/icinga2/include/plugins-contrib.d/logmanagement.conf /usr/share/icinga2/include/plugins-contrib.d/metrics.conf /usr/share/icinga2/include/plugins-contrib.d/network-components.conf /usr/share/icinga2/include/plugins-contrib.d/network-services.conf /usr/share/icinga2/include/plugins-contrib.d/operating-system.conf /usr/share/icinga2/include/plugins-contrib.d/raid-controller.conf /usr/share/icinga2/include/plugins-contrib.d/smart-attributes.conf /usr/share/icinga2/include/plugins-contrib.d/storage.conf /usr/share/icinga2/include/plugins-contrib.d/virtualization.conf /usr/share/icinga2/include/plugins-contrib.d/vmware.conf /usr/share/icinga2/include/plugins-contrib.d/web.conf

============== LOGS AND CRASH REPORTS ==============

Getting the last 20 lines of 1 FileLogger objects. Logger main-log at path: /var/log/icinga2/icinga2.log [begin: '/var/log/icinga2/icinga2.log' line: 0] [end: '/var/log/icinga2/icinga2.log' line: 0]

######################## /var/log/icinga2/icinga2.log either does not exist or is empty ########################

No crash logs found in /var/log/icinga2/crash/ `

Crunsher commented 6 years ago

@fedepires Is there really no log at all? Since the reporter could not find any. And were you able to discern some kind of pattern for the crashes?

ghost commented 6 years ago

Nothing in the logs, we checked several times. No crash logs, and icinga2.log looks as usual and then just stops. There's no apparent pattern in the crashes, all nodes are mostly the same (same OS, same kernel, similar hardware and resources).

N-o-X commented 6 years ago

@firatalkis would it be possible to have a look at your configuration? Icinga should not spam Cannot create object .. already exists. that often. There might be something wrong in your cluster setup.

Also, is there a reason for creating multiple comments every second, especially on the agent?

dnsmichi commented 6 years ago

A full core dump of the crash would help in both cases.

firatalkis commented 6 years ago

@N-o-X You can find conf files in the attachment.

when we add acknowledgement or comment ,icinga2 servers randomly fails with SIGSEGV.

confs.zip

dnsmichi commented 6 years ago

That won't work, as it is known that more than 2 endpoints in a zone create a loop with routing. In your case this would explain all these log entries and crashes later on.

object Zone "checker" {
  endpoints = [ "hosntame3", "hosntame4", "hosntame5", "hosntame6","hosntame2", "hosntame7", "hosntame1", "hosntame8", "hosntame9" ]
  parent = "master"
}
firatalkis commented 6 years ago

hi @dnsmichi,

is it possible that, 2 end points handle the workload of 7600 servser and 14500 service checks? System controls(cpu, memory, storage) are check interval 2m.

dnsmichi commented 6 years ago

When you throw enough resources onto it, sure. We don't know the specs unfortunately.

dnsmichi commented 6 years ago

Closing since it is a known problem with #3533

firatalkis commented 6 years ago

thanks @dnsmichi our problem has been solved. when we setting Icinga2 two endpoint in a zone file the icinga2 service did not crash and icinga2 working more stable

ghost commented 6 years ago

For the record, we are still seeing this on 2.9.2 and we don't have multiple endpoints on a zone anywhere. We will test 2.10.0 if this still happens soon.