NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.53k stars 445 forks source link

nagios core crash after sending multiple commands via check_mk #511

Open josdemosselman opened 6 years ago

josdemosselman commented 6 years ago

Hi all, We run nagios 4.3.4 on a CentOS 7.4 system. 8Gb RAM, 4vCPU. We run livestatus to send all alarms/statuses towards another server which runs check_mk. (frontend-backend principle) We only use livestatus commands and use check_mk only as a view. We do not use check_mk as a monitoring instance to do checks towards the server park. From this front-end server, we launch custom actions such as: acknowledge via livestatus to nagios to: acknowledge multiple services at once, force a recheck of certain results, add comments etc... We cannot reproduce this error, but it happens from time to time. Sometimes with acknowledging 10 alarms, sometimes with 100 alarms. It depends, but always when applying multiple actions at once...

If needed, a core dump can be delivered.

Nagios is built via: ./configure --prefix=/data/nagios --exec-prefix=/opt/nagios-$version -datarootdir=/opt/nagios-$version/share --with-command-group=nagios

Fault code: Program terminated with signal 11, Segmentation fault.

Backtrace:

Thread 10 (Thread 0x7f63e9c75740 (LWP 2225)):
#0  0x00007f63e9292923 in epoll_wait () from /lib64/libc.so.6
No symbol table info available.
#1  0x000000000048f868 in iobroker_poll (iobs=0x716070, timeout=1) at iobroker.c:337
        i = 0
        nfds = 0
        ret = 0
#2  0x00000000004385a4 in event_execution_loop () at events.c:1081
        now = {tv_sec = 1527079644, tv_usec = 583959}
        event_runtime = 0x3fea830
        inputs = 0
        temp_event = 0x2ef4de0
        last_event = 0x2ef4de0
        last_time = 1527079644
        current_time = 1527079644
        last_status_update = 1527079640
        poll_time_ms = 1
#3  0x0000000000414546 in main (argc=3, argv=0x7ffcdf4a4578) at nagios.c:815
        result = 0
        error = 0
        display_license = 0
        display_help = 0
        c = -1
        tm = 0x7ffcdf4a4420
        tm_s = {tm_sec = 53, tm_min = 14, tm_hour = 12, tm_mday = 23, tm_mon = 4, tm_year = 118, tm_wday = 3, tm_yday = 142, tm_isdst = 1, tm_gmtoff = 7200, tm_zone = 0x71ccc0 "CEST"}
        now = 1527070493
        datestring = "Wed May 23 12:14:53 CEST 2018", '\000' <repeats 123 times>, "hj\310\351c\177\000\000\340CJ\337\374\177\000\000\320CJ\337\374\177\000\000&\260be\000\000\000\000\347z1\351c\177\000\000E\020\260\000\000\000\000\000"...
        mac = 0x6ca460 <global_macros>
        worker_socket = 0x0
        i = 8
        sig_action = {__sigaction_handler = {sa_handler = 0x44e281 <handle_sigxfsz>, sa_sigaction = 0x44e281 <handle_sigxfsz>}, sa_mask = {__val = {18446744067267100671, 18446744073709551615 <repeats 15 times>}}, sa_flags = 1342177280, sa_restorer = 0x0}
        option_index = 0
        long_options = {{name = 0x497f80 "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x497f85 "version", has_arg = 0, flag = 0x0, val = 86}, {name = 0x497f8d "license", has_arg = 0, flag = 0x0, val = 86}, {name = 0x497f95 "verify-config", has_arg = 0, flag = 0x0, val = 118}, {name = 0x497fa3 "daemon", 
            has_arg = 0, flag = 0x0, val = 100}, {name = 0x497faa "test-scheduling", has_arg = 0, flag = 0x0, val = 115}, {name = 0x497fba "precache-objects", has_arg = 0, flag = 0x0, val = 112}, {name = 0x497fcb "use-precached-objects", has_arg = 0, flag = 0x0, val = 117}, {name = 0x497fe1 "enable-timing-point", 
            has_arg = 0, flag = 0x0, val = 84}, {name = 0x497ff5 "worker", has_arg = 1, flag = 0x0, val = 87}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
hedenface commented 6 years ago

So you don't do any manual acknowledgements? Always through livestatus?

Can you possibly switch behavior and manually acknowledge and see if it continues to happen?

What was the version of Core you used prior to 4.3.4? Did this issue occur then as well?

ghen2 commented 6 years ago

We send (bulk) commands via livestatus. It's always reproducible with a very large number of commands (1000+), and just occasionally with a reasonably low number (10+).

We tried sending bulk commands directly with a for loop piped to nagios.cmd, but this way we can not reproduce the crash. Do you have another suggestion on how to reproduce what livestatus does? Or other ways to test?

This crash has happened since we upgraded from an old CentOS5 server with nagios 3.2.1 to a new CentOS7 server with nagios 4.2, later upgraded to 4.3. So the issue is not new in 4.3.4.

hedenface commented 6 years ago

Do you have any debug log info that is helpful from during a crash?

For internal testing here, I usually have a server configured with a lot of checks that check the contents of a file. They return the contents of the file - so if it's 0 - they all return Ok, but if it's 2 they all return Critical. This allows me to simulate a system wide outage on a whim.

Are you able to configure something like this and do some forced testing? While gathering debug log info from Core? I assume livestatus has some trace/debugging messages as well that you can enable?

ghen2 commented 6 years ago

The actual check result (OK/WARN/CRIT) does not matter (we have thousands of checks and sufficient failures :)), it's only when we send commands like acknowledge, schedule downtime etc in bulk, this makes the core occasionally crash. (easily reproducible when sending >1000 commands at once)

We will enable debug logging, but we prefer not to expose the logfile here, can we send/upload it somewhere?

hedenface commented 6 years ago

No I understand that - but it needs to be in some non-OK state first in order to acknowledge :)

I didn't see before that scheduling downtime in bulk also causes an issue. I'm guessing there was some change to Core, and livestatus didn't keep up with it perhaps.

Absolutely. You can send it to me directly bheden@nagios.com.

marnovdm commented 6 years ago

FYI: We ran into this as well on our setup and changing from livestatus commands to direct nagios command pipe seems to have solved our issue (we use livestatus.py to integrate it into our codebase so this was an easy modification to make there). We had the issue on Nagios 3.5.1/Check_MK1.4 and on Nagios 4.3.2/4.3.4 with Check_MK 1.2.8. Easily reproducible by submitting a lot of scheduled downtime commands. Haven't been able to crash the Nagios core after switching to direct command pipe use. Hope this helps.

rverchere commented 5 years ago

Hi, is this related : https://github.com/NagiosEnterprises/nagioscore/issues/391#issuecomment-319336576 ?

ghen2 commented 5 years ago

It looks similar indeed. But we can still reproduce our crash with this Store.cc patch applied.

AllwynPradip commented 4 years ago

any update on this?