NagiosEnterprises / nsca

NSCA Passive Check Daemon
GNU General Public License v2.0
41 stars 25 forks source link

Nagios Core 4.2.3 , NSCA 2.9.2RC1 - Hanged after nagios restart #11

Open soxfor opened 7 years ago

soxfor commented 7 years ago

Tried with both NSCA 2.9.1 and NSCA 2.9.2 RC1, NSCA processes hang if Nagios is restarted. NSCA is running under XINETd, on CentOS 6 (fully upgraded). I haven't tried it with the previous Nagios version (this was an upgrade from Nagios Core 4.1.1 with NSCA 2.7.2 to Nagios Core 4.2.3 NSCA 2.9.1, and now 2.9.2RC1).

With NSCA 2.9.1 it reached, without restarting Nagios, to over >40000 hanged processes. This master server receives over 20000 passive checks per hour.

strace from one of the hanged processes after forcing nagios restart (service nagios restart):

strace -p 24867
Process 24867 attached
open("/usr/local/nagios/var/rw/nagios.cmd", O_WRONLY) = 4
fcntl(4, F_GETFL)                       = 0x8001 (flags O_WRONLY|O_LARGEFILE)
fstat(4, {st_mode=S_IFIFO|0660, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7d6ad93000
lseek(4, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
write(4, "[1479838042] PROCESS_SERVICE_CHE"..., 106) = 106
close(4)                                = 0
munmap(0x7f7d6ad93000, 4096)            = 0
recvfrom(0, "", 4304, 0, NULL, NULL)    = 0
sendto(3, "<27>Nov 22 18:18:43 nsca[24867]:"..., 53, MSG_NOSIGNAL, NULL, 0) = 53
close(0)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

nsca.cfg: pid_file=/var/run/nsca.pid server_port=5667 server_address=192.168.X.X nsca_user=nagios nsca_group=nagios debug=1 command_file=/usr/local/nagios/var/rw/nagios.cmd alternate_dump_file=/usr/local/nagios/var/rw/nsca.dump aggregate_writes=0 append_to_file=0 max_packet_age=30 password=<password> decryption_method=0

xinet.d config: service nsca { disable = no flags = REUSE socket_type = stream wait = no per_source = UNLIMITED instances = UNLIMITED user = nagios group = nagios server = /usr/local/nagios/bin/nsca server_args = -c /usr/local/nagios/etc/nsca.cfg --inetd log_on_failure += USERID only_from = <severeal ips> }

soxfor commented 7 years ago

Pull request #7 seems to fix it. Currently testing.

jfrickson commented 7 years ago

Looking into this

soxfor commented 7 years ago

Hey @jfrickson , I've currently upgraded towards Nagios 4.3.1 (because of this commit https://github.com/NagiosEnterprises/nagioscore/commit/cde8780d2f042472e75b1f7f56f187c634a2b06a ) and towards NSCA 2.9.2 (without any extra commits applied). The hanged nsca processes / close_wait connections still happen unless this commit https://github.com/NagiosEnterprises/nagioscore/commit/c814b7e7959583699de47998a697d4b9df13d12c is applied to the Nagios init.d script.

hedenface commented 7 years ago

I'm going to test what you've stated @soxfor and will get back to you soon with relevant information.

LHozzan commented 5 years ago

Hello. We just facing this problem too. Our environment: CentOS6, Nagios 4.3.4, NSCA 2.7.2 NSCA is running as a service (not via inetd). Same behavior, when nagios service is restarted, NSCA sometimes hang, sometimes crash without any errors in any log files.

LHozzan commented 5 years ago

Hello. After some investigating (environment is now: CentOS7, Nagios 4.3.4, NSCA 2.9.2), same problem occured. I have active debug mode in NSCA and restart Nagios. Nagios stop recieving passive checks from NSCA, but NSCA server still accepting new connections from clients and forward it to Nagios socket. It seems, that problem should be in NSCA check, when NSCA send this checks to old socket, but Nagios listening on new socket. Meantime, have somebody any solution for this problem? Thank you.

LHozzan commented 5 years ago

Hello. Problem is certainly in NSCA. When Nagios is stoped/restarted, NSCA not detect this behavior.

|7|[root@status.global-devel cmd]$ lsof /var/spool/nagios/cmd/nagios.cmd COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME nagios 981 nagios 11u FIFO 8,1 0t0 1707318 /var/spool/nagios/cmd/nagios.cmd

- restartng NSCA solve this problem

|7|[root@status.global-devel cmd]$ service nsca restart Redirecting to /bin/systemctl restart nsca.service

|7|[root@status.global-devel cmd]$ lsof /var/spool/nagios/cmd/nagios.cmd COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME nagios 981 nagios 11u FIFO 8,1 0t0 1707318 /var/spool/nagios/cmd/nagios.cmd nsca 1530 nagios 6w FIFO 8,1 0t0 1707318 /var/spool/nagios/cmd/nagios.cmd


@hedenface any ETA for fix?
sawolf commented 4 years ago

Hi @LHozzan,

I took a look into this, and unfortunately I wasn't able to reproduce this behavior so far. However, I tested against the maint branch of NSCA and the most recent release of Nagios Core, so it may be that this bug was already fixed. Here's the behavior I get when using the default settings for NSCA (plus a password):

When nagios core and nsca are both running, lsof doesn't show nsca attached to nagios.cmd:

[root@localhost nsca]# lsof /usr/local/nagios/var/rw/nagios.cmd 
COMMAND   PID   USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
nagios  64555 nagios   15u  FIFO  253,0      0t0 17648749 /usr/local/nagios/var/rw/nagios.cmd

I still get check results from core (this is from /var/log/messages):

Mar 27 11:05:45 localhost nagios: SERVICE ALERT: localhost;PING;WARNING;SOFT;1;WARNING: nsca is sending this result

When nagios core is stopped, commands are written to nsca.dump - these aren't written to core when it restarts, but you can manually pipe them into nagios.cmd if desired.

When nagios core is restarted (without touching the NSCA process or changing anything in send_nsca on the client machine), I can still send check results and have them processed:

Mar 27 11:10:50 localhost nagios: SERVICE ALERT: localhost;PING;WARNING;SOFT;2;WARNING: nsca is sending this result

Let me know if you can get different results from core 4.4.5 and the maint branch of this repo with default settings.

LHozzan commented 3 years ago

Hello @sawolf . My apologize for delay, I was busy. Many thanks for your effort I really appreciate it. Unfortunately, meantime we changed monitoring solution, so, please, consider my request as solved. Best regards.