Closed nook24 closed 6 years ago
So my broker don't use threads, but I guess this issue is related to internal Nagios threads and is just a timing issue.
Nagios does not use threads.
ModGearman does, though.
What order are those two broker modules called in your nagios.cfg? Try changing the actual order of them and then seeing if that fixes your problem.
I'm going to close this issue. That said, if you're able to reproduce the problem with a Nagios Core running no additional broker modules other than yours, I'd love to see it and help.
Good luck!
Hi @hedenface,
it's only my broker model loaded to Nagios Core. This has nothing to do with ModGearman. ModGearman is even not installed^^
My broker just make use of the gearman-job-server, to store nagios events as json objects. (like mysql gearman is just a "database")
OOOhhhh.
My bad - I saw gearman and nagios and assumed you were using Mod Gearman. My apologies.
Okay, well, quite frankly, your neb module is rather large and I'm not going to debug it. You'll need to strip it to the minimum amount of code required to still cause the issue, and then I'll take a look at it.
Ok, I will remove everything, except NEBCALLBACK_LOG_DATA
, and come back to you, if the module is still running into this issue.
@hedenface I created a stripped down version of the broker, which only will save Nagios logentries to the gearman-job-server.
You can build the broker using this command:
LANG=C gcc -DNAGIOS -shared -o "statusengine.o" -fPIC -Wall -Werror statusengine.c -luuid -levent -lgearman -ljson-c
Same issue at Nagios restart.
root@statusengine16:/opt/nagios# systemctl restart nagios
Systemd also stuck while doing the restart, as you can see at the timestamps:
[1527974975] Caught SIGTERM, shutting down...
[1527975065] Nagios 4.3.4 starting... (PID=5997)
90 seconds
Another shot:
root@statusengine16:/opt/nagios# time systemctl restart nagios
real 1m30.231s
user 0m0.000s
sys 0m0.000s
root@statusengine16:/opt/nagios#
(Fresh installation of Nagios 4.3.4 with default configuration
nagios.log reports:
[1527974975] Caught SIGTERM, shutting down...
[1527974975] Caught SIGTERM, shutting down...
[1527974975] Successfully shutdown... (PID=5959)
[1527974975] Event broker module 'NERD' deinitialized successfully.
[1527974975] [Statusengine] Deregister callbacks
[1527974975] [Statusengine] We are done here
[1527974975] [Statusengine] Bye
[1527974975] Event broker module '/opt/statusengine_nagios.o' deinitialized successfully.
[1527974975] Caught SIGTERM, shutting down...
[1527974975] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830
[1527975065] Nagios 4.3.4 starting... (PID=5997)
Second try:
[1527975488] Caught SIGTERM, shutting down...
[1527975488] Caught SIGTERM, shutting down...
[1527975488] Successfully shutdown... (PID=6061)
[1527975488] Event broker module 'NERD' deinitialized successfully.
[1527975488] [Statusengine] Deregister callbacks
[1527975488] [Statusengine] We are done here
[1527975488] [Statusengine] Bye
[1527975488] Event broker module '/opt/statusengine_nagios.o' deinitialized successfully.
[1527975488] Caught SIGTERM, shutting down...
[1527975488] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830
[1527975578] Nagios 4.3.4 starting... (PID=6226)
Here is the code: https://gist.github.com/nook24/1474fed0add839e95651094de4090bcc I used this commands to setup my Nagios: https://statusengine.org/tutorials/install-nagios4/ This is my nagios.cfg: https://gist.github.com/nook24/d1a42fa03d5861345945ac7fdbb6110a Tested on Ubuntu 16.04.
Many thank for your help :)
Ooooh. Actually, this looks familiar. A while back, I was writing a neb module that used gearman as well. Something along the lines of ModGearman, but more Nagios-y. It never saw the light of day, because we decided to work on that functionality natively to core, but I got a lot of the broken pipes for a lot of different reasons.
I'll keep you updated.
Also, your instructions are a bit off. If you place this in module/statusengine.c
you need to either adjust the #include "lib/libnagios.h"
or just compile with an -I..
(which is what I did).
I'm not having the issues you described with this. How can I reproduce this issue?
Also, I would use the logit
function instead of the write_to_all_logs
function - because if you pass the second argument to logit as TRUE
, then debugging becomes much easier. You can just run the command directly and don't need to worry about tailing log files.
What version of libgearman are you using? I had this issue a lot until I had upgraded. I wish I remembered the specific details.
On my system, this NEB module is running fine.
bryan@Lappy686 ~ $ dpkg -s libgearman-dev
Package: libgearman-dev
Status: install ok installed
Priority: extra
Section: libdevel
Installed-Size: 447
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libgearman7 (= 1.0.6-5.1build2), libevent-dev
Description: Development files for the Gearman Library
Gearman is a system to farm out work to other machines, dispatching function
calls to machines that are better suited to do work, to do work in parallel,
to load balance lots of function calls, or to call functions between languages.
.
This package contains the development headers for the libgearman library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
bryan@Lappy686 ~ $ dpkg -s libgearman7
Package: libgearman7
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 170
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Multi-Arch: same
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libc6 (>= 2.14), libgcc1 (>= 1:3.0), libstdc++6 (>= 4.1.1), libuuid1 (>= 2.20.1)
Description: Library providing Gearman client and worker functions
Gearman is a system to farm out work to other machines, dispatching function
calls to machines that are better suited to do work, to do work in parallel,
to load balance lots of function calls, or to call functions between languages.
.
This package contains the C reimplementation of the client library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
bryan@Lappy686 ~ $
Thanks for your investigations and time, i really appreciate this a lot!
First of all, I have the same version of libgearman as you have on your test system:
[20:45]root@statusengine16:/etc/init.d# dpkg -s libgearman-dev
Package: libgearman-dev
Status: install ok installed
Priority: extra
Section: libdevel
Installed-Size: 447
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libgearman7 (= 1.0.6-5.1build2), libevent-dev
Description: Development files for the Gearman Library
Gearman is a system to farm out work to other machines, dispatching function
calls to machines that are better suited to do work, to do work in parallel,
to load balance lots of function calls, or to call functions between languages.
.
This package contains the development headers for the libgearman library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
[20:46]root@statusengine16:/etc/init.d# dpkg -s libgearman7
Package: libgearman7
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 170
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Multi-Arch: same
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libc6 (>= 2.14), libgcc1 (>= 1:3.0), libstdc++6 (>= 4.1.1), libuuid1 (>= 2.20.1)
Description: Library providing Gearman client and worker functions
Gearman is a system to farm out work to other machines, dispatching function
calls to machines that are better suited to do work, to do work in parallel,
to load balance lots of function calls, or to call functions between languages.
.
This package contains the C reimplementation of the client library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
[20:46]root@statusengine16:/etc/init.d#
Have you tested restarting Nagios Core through systemd or just send a SIGTERM via kill?
I replaced write_to_all_logs
with logit
and noticed the following:
If I try to restart nagios through systemd, I run into the descriptive issue:
[20:52]root@statusengine16:/etc/init.d# time systemctl restart nagios
real 1m30.189s
user 0m0.000s
sys 0m0.000s
from nagios.log:
[1528051952] Caught SIGTERM, shutting down...
[1528051952] Caught SIGTERM, shutting down...
[1528051952] wproc: Socket to worker Core Worker 12154 broken, removing
[1528051952] wproc: Socket to worker Core Worker 12156 broken, removing
[1528051952] wproc: Socket to worker Core Worker 12157 broken, removing
[1528051952] wproc: Socket to worker Core Worker 12155 broken, removing
[1528051952] Successfully shutdown... (PID=12152)
[1528051952] Event broker module 'NERD' deinitialized successfully.
[1528051952] [Statusengine] Deregister callbacks
[1528051952] [Statusengine] We are done here
[1528051952] [Statusengine] Bye
[1528051952] Event broker module '/opt/statusengine_nagios.o' deinitialized successfully.
[1528051952] Caught SIGTERM, shutting down...
[1528051952] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830
[1528052042] Nagios 4.3.4 starting... (PID=12346)
As you can see, Nagios catches 3 times a SIGTERM
If i just kill nagios through kill
, I can not reproduce the issue:
kill 12653
Nagios stdout:
[20:55]root@statusengine16:/tmp/module/src# sudo -u nagios /bin/bash
bash: /root/.bashrc: Keine Berechtigung
nagios@statusengine16:/tmp/module/src$ /opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg
Nagios Core 4.3.4
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2017-08-24
License: GPL
Website: https://www.nagios.org
Nagios 4.3.4 starting... (PID=12653)
Local time is Sun Jun 03 20:56:31 CEST 2018
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Channel opathchecks registered successfully
nerd: Fully initialized and ready to rock!
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 12657;pid=12657
wproc: Registry request: name=Core Worker 12656;pid=12656
wproc: Registry request: name=Core Worker 12655;pid=12655
wproc: Registry request: name=Core Worker 12654;pid=12654
Statusengine - the missing event broker
Statusengine - the missing event broker
[Statusengine] Copyright (c) 2014 - present Daniel Ziegler <daniel@statusengine.org>
[Statusengine] Please visit https://www.statusengine.org for more information
[Statusengine] Contribute to Statusenigne at: https://github.com/nook24/statusengine
[Statusengine] Thanks for using Statusengine :-)
[Statusengine] Register callbacks
Event broker module '/opt/statusengine_nagios.o' initialized successfully.
Successfully launched command file worker with pid 12658
Caught SIGTERM, shutting down...
Successfully shutdown... (PID=12653)
[Statusengine] Deregister callbacks
[Statusengine] We are done here
[Statusengine] Bye
nagios@statusengine16:/tmp/module/src$
So maybe this is not a Nagios related issue itself, and my systemd config is buggy?
Unfortunately I didn't found a Nagios systemd config, so i modified the config file of the Naemon Project. (With Naemon i don't run into this issue by the way)
This is my /lib/systemd/system/nagios.service
[Unit]
Description=Nagios network monitor
After=network.target gearman-job-server.service
[Service]
Type=forking
PIDFile=/opt/nagios/var/nagios.lock
ExecStartPre=/opt/nagios/bin/nagios --verify-config /opt/nagios/etc/nagios.cfg
ExecStart=/opt/nagios/bin/nagios --daemon /opt/nagios/etc/nagios.cfg
ExecReload=/bin/kill -HUP $MAINPID
User=nagios
Group=nagios
StandardOutput=journal
StandardError=inherit
[Install]
WantedBy=multi-user.target
Can you please past me your complete gcc
command? I tried to use -I
but my C skills are not available and I was not able to compile the broker, if i set -I and remov all nagios includes...
//Edit The issue only occurs, if the broker exports log data. If I disable log data and only export status data, notifications, acks, host checks etc.. the issue is gone
Updated gist with logit
function:
https://gist.github.com/nook24/1474fed0add839e95651094de4090bcc
Take systemd out of the equation.
Run the following:
/usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
And run it in the terminal. Then, in another terminal, run the following:
tail -f /usr/local/nagios/var/nagios.log
So, put your statusengine.c
file in the module/
directory, and then run the following command:
gcc -DNAGIOS -shared -o "statusengine.o" -fPIC -I.. -Wall -Werror statusengine.c -luuid -levent -lgearman -ljson-c
I also saw the 3 messages. I was sending a kill -s TERM $pid
in another terminal to get it to shut down gracefully. Even with a ctrl+c, or the TERM, I was unable to see the error. Also though, my config is 1 host and 4 services - so maybe I just need them to be doing things. I did submit some passive check results, but that didn't help.
Take a look here for inspiration: https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L36
Edit: I'm dumb. I forgot you set up the gist because I told you to strip it down :)
Jep, the project is available at GitHub: https://github.com/statusengine and https://statusengine.org/ if you are interested in more information.
Can you please check, if you run into this issue if you use systemd to restart your nagios core? Because i was also not able to reproduce the issue, if I just send kill to nagios core in a second terminal... This is a bit weird, because systemd should also only send a SIGTERM on restart.
How every, i will create a short bash script, this evening, which hopefully will build a system where the issue occurs. Than you can try this on a plain Ubuntu VM, LXC Container or DO Droplet.
I will also try to add the signal(SIGPIPE, SIG_IGN);
. At my first try some constants or what every where missing, need to check what i need to include... (I am really not a C guy )
I crated a bash script, which install Nagios 4.3.4 and the stripped Statusengine broker. I tested this script on a blank Ubuntu 16.04 digitalocean Droplet, but it should work on every blank Ubuntu 16.04 box: https://gist.github.com/nook24/a5b6ae30f56d4f6187837d85382e50ef
I also noticed, that I can not reproduce the issue, if I run Nagios in foreground and kill it in a second terminal. Also if I run bin/nagios -d etc/nagios.cfg
kill it with a second terminal, everything works.
But: If if try to restart or stop nagios via systemd, things get smashed up. I also recorded my terminal for you, so you can see what I've done so far: https://asciinema.org/a/poIzLG9oLL1mDD7ovrVNXyKJD
The new broker version with signal()
: https://gist.github.com/nook24/1474fed0add839e95651094de4090bcc
As already mentioned, this only happens, if the broker export nagios logentires. If I disable logentries in the broker module, everything works fine - I have no idea...
Weird. I watched your video. Thanks for that.
What does the service file look like? You compiled core from source from the latest master branch? Or are you using a package?
I build Nagios from the source tar.gz:
wget https://github.com/NagiosEnterprises/nagioscore/releases/download/nagios-4.3.4/nagios-4.3.4.tar.gz
tar xfv nagios-4.3.4.tar.gz
cd nagios-4.3.4/
./configure --prefix=/opt/nagios --with-nagios-user=nagios --with-nagios-group=nagios
mkdir /opt/nagios
make all
make install
make install-commandmode
make install-config
https://gist.github.com/nook24/a5b6ae30f56d4f6187837d85382e50ef#file-nagios_deploy-sh-L22-L32
This is my systemd service file:
[Unit]
Description=Nagios network monitor
After=network.target
[Service]
Type=forking
PIDFile=/opt/nagios/var/nagios.lock
ExecStartPre=/opt/nagios/bin/nagios --verify-config /opt/nagios/etc/nagios.cfg
ExecStart=/opt/nagios/bin/nagios --daemon /opt/nagios/etc/nagios.cfg
ExecReload=/bin/kill -HUP \$MAINPID
User=nagios
Group=nagios
StandardOutput=journal
StandardError=inherit
[Install]
WantedBy=multi-user.target
https://gist.github.com/nook24/a5b6ae30f56d4f6187837d85382e50ef#file-nagios_deploy-sh-L145-L161
I noticed this issue first with Nagios 4.0.8. But i guess 4.0.8 was the first version I started via systemd....
Please use the current master branch (upcoming 4.4.0 release) and see what happens then. Make sure to overwrite existing systemd file.
@hedenface nice, looks like the issue is gone with Nagios Core 4.4.0
and the default systemd configuration out of the repo:
[Unit]
Description=Nagios Core 4.4.0
Documentation=https://www.nagios.org/documentation
After=network.target local-fs.target
[Service]
Type=forking
ExecStartPre=/opt/nagios/bin/nagios -v /opt/nagios/etc/nagios.cfg
ExecStart=/opt/nagios/bin/nagios -d /opt/nagios/etc/nagios.cfg
ExecStop=/bin/kill -s TERM ${MAINPID}
ExecStopPost=/bin/rm -f /opt/nagios/var/rw/nagios.cmd
ExecReload=/bin/kill -s HUP ${MAINPID}
[Install]
WantedBy=multi-user.target
I will update my documentation https://github.com/statusengine/docs/issues/3 and close this issue.
Many thanks for your help!
Excellent. Glad to help.
Coming up for Core 5, I plan on doing some work for the NEB API and then also documenting it. Would you be willing to help with the documentation whenever that starts?
A documentation is definitely a good idea! Most of the time I had no clue of what I was doing :)
Drop me a mail when you like to start the documentation and I will see if I can help you. https://github.com/statusengine/module/blob/master/src/statusengine.c#L15
Is there a roadmap for Nagios 5 available?
Hi, I noticed an issue with my Nagios Event Broker Module, but I'm not able to trace this back to find the root cause of my issue. So may be it is not a Nagios issue and I just misunderstood something.
My Broker Module fetch every Nagios event, and export it to a Gearman Queue as json objects. I noticed that if I try to restart Nagios Core, i get the following message to the nagios.log:
If the nebmodule_deinit method of the broker gets called, it will write
[Statusengine] Deregister callbacks
to the nagios.log.It looks like, that after deinit was called, Nagios still call the modules callback to export other log entries. Which should be the reason for the broken pipe message.
If I
strace
the Nagios process, I can see, that the Callback of the broker try to export nagios log messages:And run into a broken pipe, because nebmodule_deinit was already executed.
So my broker don't use threads, but I guess this issue is related to internal Nagios threads and is just a timing issue. Thread 1 call deinit of the broker Thread 2 call the broker callbacks to process log entries
This happens on Nagios 4.0.8 and 4.3.4.
May be you have an idea, of how I could prevent this issue. I also tried something like this, but this was not helping at all:
The Broker Code is available here: https://github.com/statusengine/module/blob/master/src/statusengine.c