NagiosEnterprises / nagioscore

Nagios Core
GNU General Public License v2.0
1.56k stars 449 forks source link

Nagios Core call NEB method after nebmodule_deinit() was executed #442

Closed nook24 closed 6 years ago

nook24 commented 7 years ago

Hi, I noticed an issue with my Nagios Event Broker Module, but I'm not able to trace this back to find the root cause of my issue. So may be it is not a Nagios issue and I just misunderstood something.

My Broker Module fetch every Nagios event, and export it to a Gearman Queue as json objects. I noticed that if I try to restart Nagios Core, i get the following message to the nagios.log:

[1508766761] Caught SIGTERM, shutting down...
[1508766761] Caught SIGTERM, shutting down...
[1508766761] wproc: Socket to worker Core Worker 31996 broken, removing
[1508766761] wproc: Socket to worker Core Worker 31993 broken, removing
[1508766761] wproc: Socket to worker Core Worker 31995 broken, removing
[1508766761] wproc: Socket to worker Core Worker 31994 broken, removing
[1508766761] Successfully shutdown... (PID=31990)
[1508766761] Event broker module 'NERD' deinitialized successfully.
[1508766761] [Statusengine] Deregister callbacks
[1508766761] [Statusengine] We are done here
[1508766761] [Statusengine] Bye
[1508766761] Event broker module '/opt/nagios_420/statusengine-4-1-0.o' deinitialized successfully.
[1508766761] Caught SIGTERM, shutting down...
[1508766761] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830

If the nebmodule_deinit method of the broker gets called, it will write [Statusengine] Deregister callbacks to the nagios.log.

It looks like, that after deinit was called, Nagios still call the modules callback to export other log entries. Which should be the reason for the broken pipe message.

If I strace the Nagios process, I can see, that the Callback of the broker try to export nagios log messages:

setsockopt(10, SOL_SOCKET, SO_RCVBUF, [32768], 4) = 0
connect(10, {sa_family=AF_INET, sin_port=htons(4730), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=10, events=POLLOUT}], 1, -1)  = 1 ([{fd=10, revents=POLLOUT}])
sendto(10, "\0REQ\0\0\0\22\0\0\1\36statusngin_logentrie"..., 298, MSG_NOSIGNAL, NULL, 0) = 298
poll([{fd=10, events=POLLIN}], 1, -1)   = 1 ([{fd=10, revents=POLLIN}])
recvfrom(10, "\0RES\0\0\0\10\0\0\0\35H:dziegler-master-xe"..., 8192, MSG_NOSIGNAL, NULL, NULL) = 41
recvfrom(10, 0x16ae8a1, 8192, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)

And run into a broken pipe, because nebmodule_deinit was already executed.

So my broker don't use threads, but I guess this issue is related to internal Nagios threads and is just a timing issue. Thread 1 call deinit of the broker Thread 2 call the broker callbacks to process log entries

This happens on Nagios 4.0.8 and 4.3.4.

May be you have an idea, of how I could prevent this issue. I also tried something like this, but this was not helping at all:

int init = 0;

int nebmodule_init(int flags, char *args, nebmodule *handle){
    init = 1;
    //Rest of the code
}

int nebmodule_deinit(int flags, int reason){
    init = 0;
    //Rest of the code
}

int statusengine_handle_data(int event_type, void *data){
    if(init == 0){
        return 0;
    }
    //Rest of the code
}

The Broker Code is available here: https://github.com/statusengine/module/blob/master/src/statusengine.c

hedenface commented 6 years ago

So my broker don't use threads, but I guess this issue is related to internal Nagios threads and is just a timing issue.

Nagios does not use threads.

hedenface commented 6 years ago

ModGearman does, though.

hedenface commented 6 years ago

What order are those two broker modules called in your nagios.cfg? Try changing the actual order of them and then seeing if that fixes your problem.

hedenface commented 6 years ago

I'm going to close this issue. That said, if you're able to reproduce the problem with a Nagios Core running no additional broker modules other than yours, I'd love to see it and help.

Good luck!

nook24 commented 6 years ago

Hi @hedenface,

it's only my broker model loaded to Nagios Core. This has nothing to do with ModGearman. ModGearman is even not installed^^

My broker just make use of the gearman-job-server, to store nagios events as json objects. (like mysql gearman is just a "database")

hedenface commented 6 years ago

OOOhhhh.

My bad - I saw gearman and nagios and assumed you were using Mod Gearman. My apologies.

hedenface commented 6 years ago

Okay, well, quite frankly, your neb module is rather large and I'm not going to debug it. You'll need to strip it to the minimum amount of code required to still cause the issue, and then I'll take a look at it.

nook24 commented 6 years ago

Ok, I will remove everything, except NEBCALLBACK_LOG_DATA, and come back to you, if the module is still running into this issue.

nook24 commented 6 years ago

@hedenface I created a stripped down version of the broker, which only will save Nagios logentries to the gearman-job-server.

You can build the broker using this command:

LANG=C gcc -DNAGIOS -shared -o "statusengine.o" -fPIC  -Wall -Werror statusengine.c -luuid -levent -lgearman -ljson-c

Same issue at Nagios restart.

root@statusengine16:/opt/nagios# systemctl restart nagios

Systemd also stuck while doing the restart, as you can see at the timestamps:

[1527974975] Caught SIGTERM, shutting down...
[1527975065] Nagios 4.3.4 starting... (PID=5997)
90 seconds

Another shot:

root@statusengine16:/opt/nagios# time systemctl restart nagios

real    1m30.231s
user    0m0.000s
sys     0m0.000s
root@statusengine16:/opt/nagios#

(Fresh installation of Nagios 4.3.4 with default configuration

nagios.log reports:

[1527974975] Caught SIGTERM, shutting down...
[1527974975] Caught SIGTERM, shutting down...
[1527974975] Successfully shutdown... (PID=5959)
[1527974975] Event broker module 'NERD' deinitialized successfully.
[1527974975] [Statusengine] Deregister callbacks
[1527974975] [Statusengine] We are done here
[1527974975] [Statusengine] Bye
[1527974975] Event broker module '/opt/statusengine_nagios.o' deinitialized successfully.
[1527974975] Caught SIGTERM, shutting down...
[1527974975] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830
[1527975065] Nagios 4.3.4 starting... (PID=5997)

Second try:

[1527975488] Caught SIGTERM, shutting down...
[1527975488] Caught SIGTERM, shutting down...
[1527975488] Successfully shutdown... (PID=6061)
[1527975488] Event broker module 'NERD' deinitialized successfully.
[1527975488] [Statusengine] Deregister callbacks
[1527975488] [Statusengine] We are done here
[1527975488] [Statusengine] Bye
[1527975488] Event broker module '/opt/statusengine_nagios.o' deinitialized successfully.
[1527975488] Caught SIGTERM, shutting down...
[1527975488] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830
[1527975578] Nagios 4.3.4 starting... (PID=6226)

Here is the code: https://gist.github.com/nook24/1474fed0add839e95651094de4090bcc I used this commands to setup my Nagios: https://statusengine.org/tutorials/install-nagios4/ This is my nagios.cfg: https://gist.github.com/nook24/d1a42fa03d5861345945ac7fdbb6110a Tested on Ubuntu 16.04.

Many thank for your help :)

hedenface commented 6 years ago

Ooooh. Actually, this looks familiar. A while back, I was writing a neb module that used gearman as well. Something along the lines of ModGearman, but more Nagios-y. It never saw the light of day, because we decided to work on that functionality natively to core, but I got a lot of the broken pipes for a lot of different reasons.

I'll keep you updated.

hedenface commented 6 years ago

Also, your instructions are a bit off. If you place this in module/statusengine.c you need to either adjust the #include "lib/libnagios.h" or just compile with an -I.. (which is what I did).

hedenface commented 6 years ago

I'm not having the issues you described with this. How can I reproduce this issue?

hedenface commented 6 years ago

Also, I would use the logit function instead of the write_to_all_logs function - because if you pass the second argument to logit as TRUE, then debugging becomes much easier. You can just run the command directly and don't need to worry about tailing log files.

hedenface commented 6 years ago

What version of libgearman are you using? I had this issue a lot until I had upgraded. I wish I remembered the specific details.

On my system, this NEB module is running fine.

bryan@Lappy686 ~ $ dpkg -s libgearman-dev
Package: libgearman-dev
Status: install ok installed
Priority: extra
Section: libdevel
Installed-Size: 447
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libgearman7 (= 1.0.6-5.1build2), libevent-dev
Description: Development files for the Gearman Library
 Gearman is a system to farm out work to other machines, dispatching function
 calls to machines that are better suited to do work, to do work in parallel,
 to load balance lots of function calls, or to call functions between languages.
 .
 This package contains the development headers for the libgearman library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
bryan@Lappy686 ~ $ dpkg -s libgearman7
Package: libgearman7
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 170
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Multi-Arch: same
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libc6 (>= 2.14), libgcc1 (>= 1:3.0), libstdc++6 (>= 4.1.1), libuuid1 (>= 2.20.1)
Description: Library providing Gearman client and worker functions
 Gearman is a system to farm out work to other machines, dispatching function
 calls to machines that are better suited to do work, to do work in parallel,
 to load balance lots of function calls, or to call functions between languages.
 .
 This package contains the C reimplementation of the client library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
bryan@Lappy686 ~ $ 
nook24 commented 6 years ago

Thanks for your investigations and time, i really appreciate this a lot!

First of all, I have the same version of libgearman as you have on your test system:

[20:45]root@statusengine16:/etc/init.d# dpkg -s libgearman-dev
Package: libgearman-dev
Status: install ok installed
Priority: extra
Section: libdevel
Installed-Size: 447
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libgearman7 (= 1.0.6-5.1build2), libevent-dev
Description: Development files for the Gearman Library
 Gearman is a system to farm out work to other machines, dispatching function
 calls to machines that are better suited to do work, to do work in parallel,
 to load balance lots of function calls, or to call functions between languages.
 .
 This package contains the development headers for the libgearman library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
[20:46]root@statusengine16:/etc/init.d# dpkg -s libgearman7
Package: libgearman7
Status: install ok installed
Priority: optional
Section: misc
Installed-Size: 170
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Multi-Arch: same
Source: gearmand
Version: 1.0.6-5.1build2
Depends: libc6 (>= 2.14), libgcc1 (>= 1:3.0), libstdc++6 (>= 4.1.1), libuuid1 (>= 2.20.1)
Description: Library providing Gearman client and worker functions
 Gearman is a system to farm out work to other machines, dispatching function
 calls to machines that are better suited to do work, to do work in parallel,
 to load balance lots of function calls, or to call functions between languages.
 .
 This package contains the C reimplementation of the client library.
Original-Maintainer: Stig Sandbeck Mathisen <ssm@debian.org>
Homepage: http://launchpad.net/gearmand
[20:46]root@statusengine16:/etc/init.d#

Have you tested restarting Nagios Core through systemd or just send a SIGTERM via kill? I replaced write_to_all_logs with logit and noticed the following:

If I try to restart nagios through systemd, I run into the descriptive issue:

[20:52]root@statusengine16:/etc/init.d# time systemctl restart nagios

real    1m30.189s
user    0m0.000s
sys 0m0.000s

from nagios.log:

[1528051952] Caught SIGTERM, shutting down...
[1528051952] Caught SIGTERM, shutting down...
[1528051952] wproc: Socket to worker Core Worker 12154 broken, removing
[1528051952] wproc: Socket to worker Core Worker 12156 broken, removing
[1528051952] wproc: Socket to worker Core Worker 12157 broken, removing
[1528051952] wproc: Socket to worker Core Worker 12155 broken, removing
[1528051952] Successfully shutdown... (PID=12152)
[1528051952] Event broker module 'NERD' deinitialized successfully.
[1528051952] [Statusengine] Deregister callbacks
[1528051952] [Statusengine] We are done here
[1528051952] [Statusengine] Bye
[1528051952] Event broker module '/opt/statusengine_nagios.o' deinitialized successfully.
[1528051952] Caught SIGTERM, shutting down...
[1528051952] flush(Broken pipe) lost connection to server during send -> libgearman/connection.cc:830
[1528052042] Nagios 4.3.4 starting... (PID=12346)

As you can see, Nagios catches 3 times a SIGTERM

If i just kill nagios through kill, I can not reproduce the issue:

kill 12653

Nagios stdout:

[20:55]root@statusengine16:/tmp/module/src# sudo -u nagios /bin/bash
bash: /root/.bashrc: Keine Berechtigung
nagios@statusengine16:/tmp/module/src$ /opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg

Nagios Core 4.3.4
Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 2017-08-24
License: GPL

Website: https://www.nagios.org
Nagios 4.3.4 starting... (PID=12653)
Local time is Sun Jun 03 20:56:31 CEST 2018
nerd: Channel hostchecks registered successfully
nerd: Channel servicechecks registered successfully
nerd: Channel opathchecks registered successfully
nerd: Fully initialized and ready to rock!
wproc: Successfully registered manager as @wproc with query handler
wproc: Registry request: name=Core Worker 12657;pid=12657
wproc: Registry request: name=Core Worker 12656;pid=12656
wproc: Registry request: name=Core Worker 12655;pid=12655
wproc: Registry request: name=Core Worker 12654;pid=12654
Statusengine - the missing event broker
Statusengine - the missing event broker
[Statusengine] Copyright (c) 2014 - present Daniel Ziegler <daniel@statusengine.org>
[Statusengine] Please visit https://www.statusengine.org for more information
[Statusengine] Contribute to Statusenigne at: https://github.com/nook24/statusengine
[Statusengine] Thanks for using Statusengine :-)
[Statusengine] Register callbacks
Event broker module '/opt/statusengine_nagios.o' initialized successfully.
Successfully launched command file worker with pid 12658
Caught SIGTERM, shutting down...
Successfully shutdown... (PID=12653)
[Statusengine] Deregister callbacks
[Statusengine] We are done here
[Statusengine] Bye
nagios@statusengine16:/tmp/module/src$

So maybe this is not a Nagios related issue itself, and my systemd config is buggy? Unfortunately I didn't found a Nagios systemd config, so i modified the config file of the Naemon Project. (With Naemon i don't run into this issue by the way) This is my /lib/systemd/system/nagios.service

[Unit]
Description=Nagios network monitor
After=network.target gearman-job-server.service

[Service]
Type=forking
PIDFile=/opt/nagios/var/nagios.lock
ExecStartPre=/opt/nagios/bin/nagios --verify-config /opt/nagios/etc/nagios.cfg
ExecStart=/opt/nagios/bin/nagios --daemon /opt/nagios/etc/nagios.cfg
ExecReload=/bin/kill -HUP $MAINPID
User=nagios
Group=nagios
StandardOutput=journal
StandardError=inherit

[Install]
WantedBy=multi-user.target

Can you please past me your complete gcc command? I tried to use -I but my C skills are not available and I was not able to compile the broker, if i set -I and remov all nagios includes...

//Edit The issue only occurs, if the broker exports log data. If I disable log data and only export status data, notifications, acks, host checks etc.. the issue is gone


Updated gist with logit function: https://gist.github.com/nook24/1474fed0add839e95651094de4090bcc

hedenface commented 6 years ago

Take systemd out of the equation.

Run the following:

/usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg

And run it in the terminal. Then, in another terminal, run the following:

tail -f /usr/local/nagios/var/nagios.log

So, put your statusengine.c file in the module/ directory, and then run the following command:

gcc -DNAGIOS -shared -o "statusengine.o" -fPIC -I..  -Wall -Werror statusengine.c -luuid -levent -lgearman -ljson-c
hedenface commented 6 years ago

I also saw the 3 messages. I was sending a kill -s TERM $pid in another terminal to get it to shut down gracefully. Even with a ctrl+c, or the TERM, I was unable to see the error. Also though, my config is 1 host and 4 services - so maybe I just need them to be doing things. I did submit some passive check results, but that didn't help.

hedenface commented 6 years ago

Take a look here for inspiration: https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L36

hedenface commented 6 years ago

Edit: I'm dumb. I forgot you set up the gist because I told you to strip it down :)

nook24 commented 6 years ago

Jep, the project is available at GitHub: https://github.com/statusengine and https://statusengine.org/ if you are interested in more information.

Can you please check, if you run into this issue if you use systemd to restart your nagios core? Because i was also not able to reproduce the issue, if I just send kill to nagios core in a second terminal... This is a bit weird, because systemd should also only send a SIGTERM on restart.

How every, i will create a short bash script, this evening, which hopefully will build a system where the issue occurs. Than you can try this on a plain Ubuntu VM, LXC Container or DO Droplet.

I will also try to add the signal(SIGPIPE, SIG_IGN);. At my first try some constants or what every where missing, need to check what i need to include... (I am really not a C guy )

nook24 commented 6 years ago

I crated a bash script, which install Nagios 4.3.4 and the stripped Statusengine broker. I tested this script on a blank Ubuntu 16.04 digitalocean Droplet, but it should work on every blank Ubuntu 16.04 box: https://gist.github.com/nook24/a5b6ae30f56d4f6187837d85382e50ef

I also noticed, that I can not reproduce the issue, if I run Nagios in foreground and kill it in a second terminal. Also if I run bin/nagios -d etc/nagios.cfg kill it with a second terminal, everything works.

But: If if try to restart or stop nagios via systemd, things get smashed up. I also recorded my terminal for you, so you can see what I've done so far: https://asciinema.org/a/poIzLG9oLL1mDD7ovrVNXyKJD

The new broker version with signal(): https://gist.github.com/nook24/1474fed0add839e95651094de4090bcc

As already mentioned, this only happens, if the broker export nagios logentires. If I disable logentries in the broker module, everything works fine - I have no idea...

hedenface commented 6 years ago

Weird. I watched your video. Thanks for that.

What does the service file look like? You compiled core from source from the latest master branch? Or are you using a package?

nook24 commented 6 years ago

I build Nagios from the source tar.gz:

wget https://github.com/NagiosEnterprises/nagioscore/releases/download/nagios-4.3.4/nagios-4.3.4.tar.gz
tar xfv nagios-4.3.4.tar.gz
cd nagios-4.3.4/

./configure --prefix=/opt/nagios --with-nagios-user=nagios --with-nagios-group=nagios
mkdir /opt/nagios
make all
make install
make install-commandmode
make install-config

https://gist.github.com/nook24/a5b6ae30f56d4f6187837d85382e50ef#file-nagios_deploy-sh-L22-L32

This is my systemd service file:

[Unit]
Description=Nagios network monitor
After=network.target
[Service]
Type=forking
PIDFile=/opt/nagios/var/nagios.lock
ExecStartPre=/opt/nagios/bin/nagios --verify-config /opt/nagios/etc/nagios.cfg
ExecStart=/opt/nagios/bin/nagios --daemon /opt/nagios/etc/nagios.cfg
ExecReload=/bin/kill -HUP \$MAINPID
User=nagios
Group=nagios
StandardOutput=journal
StandardError=inherit
[Install]
WantedBy=multi-user.target

https://gist.github.com/nook24/a5b6ae30f56d4f6187837d85382e50ef#file-nagios_deploy-sh-L145-L161

I noticed this issue first with Nagios 4.0.8. But i guess 4.0.8 was the first version I started via systemd....

hedenface commented 6 years ago

Please use the current master branch (upcoming 4.4.0 release) and see what happens then. Make sure to overwrite existing systemd file.

nook24 commented 6 years ago

@hedenface nice, looks like the issue is gone with Nagios Core 4.4.0 and the default systemd configuration out of the repo:

[Unit]
Description=Nagios Core 4.4.0
Documentation=https://www.nagios.org/documentation
After=network.target local-fs.target

[Service]
Type=forking
ExecStartPre=/opt/nagios/bin/nagios -v /opt/nagios/etc/nagios.cfg
ExecStart=/opt/nagios/bin/nagios -d /opt/nagios/etc/nagios.cfg
ExecStop=/bin/kill -s TERM ${MAINPID}
ExecStopPost=/bin/rm -f /opt/nagios/var/rw/nagios.cmd
ExecReload=/bin/kill -s HUP ${MAINPID}

[Install]
WantedBy=multi-user.target

I will update my documentation https://github.com/statusengine/docs/issues/3 and close this issue.

Many thanks for your help!

hedenface commented 6 years ago

Excellent. Glad to help.

Coming up for Core 5, I plan on doing some work for the NEB API and then also documenting it. Would you be willing to help with the documentation whenever that starts?

nook24 commented 6 years ago

A documentation is definitely a good idea! Most of the time I had no clue of what I was doing :)

Drop me a mail when you like to start the documentation and I will see if I can help you. https://github.com/statusengine/module/blob/master/src/statusengine.c#L15

Is there a roadmap for Nagios 5 available?