Open josdemosselman opened 6 years ago
So you don't do any manual acknowledgements? Always through livestatus?
Can you possibly switch behavior and manually acknowledge and see if it continues to happen?
What was the version of Core you used prior to 4.3.4? Did this issue occur then as well?
We send (bulk) commands via livestatus. It's always reproducible with a very large number of commands (1000+), and just occasionally with a reasonably low number (10+).
We tried sending bulk commands directly with a for loop piped to nagios.cmd, but this way we can not reproduce the crash. Do you have another suggestion on how to reproduce what livestatus does? Or other ways to test?
This crash has happened since we upgraded from an old CentOS5 server with nagios 3.2.1 to a new CentOS7 server with nagios 4.2, later upgraded to 4.3. So the issue is not new in 4.3.4.
Do you have any debug log info that is helpful from during a crash?
For internal testing here, I usually have a server configured with a lot of checks that check the contents of a file. They return the contents of the file - so if it's 0 - they all return Ok, but if it's 2 they all return Critical. This allows me to simulate a system wide outage on a whim.
Are you able to configure something like this and do some forced testing? While gathering debug log info from Core? I assume livestatus has some trace/debugging messages as well that you can enable?
The actual check result (OK/WARN/CRIT) does not matter (we have thousands of checks and sufficient failures :)), it's only when we send commands like acknowledge, schedule downtime etc in bulk, this makes the core occasionally crash. (easily reproducible when sending >1000 commands at once)
We will enable debug logging, but we prefer not to expose the logfile here, can we send/upload it somewhere?
No I understand that - but it needs to be in some non-OK state first in order to acknowledge :)
I didn't see before that scheduling downtime in bulk also causes an issue. I'm guessing there was some change to Core, and livestatus didn't keep up with it perhaps.
Absolutely. You can send it to me directly bheden@nagios.com.
FYI: We ran into this as well on our setup and changing from livestatus commands to direct nagios command pipe seems to have solved our issue (we use livestatus.py to integrate it into our codebase so this was an easy modification to make there). We had the issue on Nagios 3.5.1/Check_MK1.4 and on Nagios 4.3.2/4.3.4 with Check_MK 1.2.8. Easily reproducible by submitting a lot of scheduled downtime commands. Haven't been able to crash the Nagios core after switching to direct command pipe use. Hope this helps.
Hi, is this related : https://github.com/NagiosEnterprises/nagioscore/issues/391#issuecomment-319336576 ?
It looks similar indeed. But we can still reproduce our crash with this Store.cc patch applied.
any update on this?
Hi all, We run nagios 4.3.4 on a CentOS 7.4 system. 8Gb RAM, 4vCPU. We run livestatus to send all alarms/statuses towards another server which runs check_mk. (frontend-backend principle) We only use livestatus commands and use check_mk only as a view. We do not use check_mk as a monitoring instance to do checks towards the server park. From this front-end server, we launch custom actions such as: acknowledge via livestatus to nagios to: acknowledge multiple services at once, force a recheck of certain results, add comments etc... We cannot reproduce this error, but it happens from time to time. Sometimes with acknowledging 10 alarms, sometimes with 100 alarms. It depends, but always when applying multiple actions at once...
If needed, a core dump can be delivered.
Nagios is built via: ./configure --prefix=/data/nagios --exec-prefix=/opt/nagios-$version -datarootdir=/opt/nagios-$version/share --with-command-group=nagios
Fault code: Program terminated with signal 11, Segmentation fault.
Backtrace: