NagiosEnterprises / ndoutils

NDOUtils - Database Output for Nagios Core
GNU General Public License v2.0
48 stars 21 forks source link

Random queue recv error #61

Open tatref opened 4 years ago

tatref commented 4 years ago

Hi,

This is an old issue for me (several years), but I was reluctant to open an issue, because I thought the issue was on my side. Since I'm not sure of this anymore, here is the issue:

I am encountering random queue receive errors

ndo2db: Error: queue recv error.

The database is up and working, as well as nagios. Restarting the database or nagios does not solve the issue. The only workaround is to restart the ndo2db daemon.

When this happens /var/log/messages is filled with error messages, filling up the disk.

I administer a lot of Nagios instances in various versions, but at least the following versions did encounter the issue:

The kernel parameter are tuned. Even if the values are way to big, this shouldn't be an issue:

I don't know how to reproduce the issue.

Does any one else is experiencing this? Could it be possible to reduce the logging (1 message per minute for example)?

Thanks everyone for your feedback!

jorgeroman commented 3 years ago

Hi,

We have the same problem.

nagios 4.4.3 ndo 2.1.3 OS: Debian 9

The issue here it's has been working right quite good until a week ago. From that moment our syslog are flooded of this mesaages:

ndo2db: Error: queue recv error.

We have launched a strace to the ndo2db process and the problem seems to be when ndo try to get the message from the kernel queue:

27010 09:59:15 msgrcv(163840, {1, "\n206:\n1=701\n2=0\n3=0\n4=1611136755.778340\n53=edge-lax10g1\n114=TPS-IOPS /dev/sda\n12=0\n25=1\n76=4\n121=1\n118=0\n123=60\n127=\n13=\n14=\n117"...}, 1024, 1, MSG_NOERROR) = 1024 27010 09:59:15 poll([{fd=6<UNIX:[1275946->1275428]>, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) 27010 09:59:15 sendto(6<UNIX:[1275946->1275428]>, "\322\3\0\0\3INSERT INTO nagios_servicechecks SET instance_id='1', service_object_id='775', check_type='0', current_check_attempt='1', m"..., 982, 0, NULL, 0) = 982 27010 09:59:15 recvfrom(6<UNIX:[1275946->1275428]>, "\17\0\0\1\0\1\376\v\210\255K\0\0\0\0\2\0\0\0", 16384, 0, NULL, NULL) = 19 27010 09:59:15 (163840, {1, "\n213:\n1=1202\n2=0\n3=0\n4=1611136755.778355\n53=edge-lax10g1\n114=TPS-IOPS /dev/sda\n95='IOPS ': 573.00\n125=\n99='IOPS'=573.00\n27=0\n51="...}, 1024, 1, MSG_NOERROR) = 1024 27010 09:59:15 poll([{fd=6<UNIX:[1275946->1275428]>, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) 27010 09:59:15 sendto(6<UNIX:[1275946->1275428]>, "]\v\0\0\3INSERT INTO nagios_servicestatus SET instance_id='1', service_object_id='775', status_update_time=FROM_UNIXTIME(1611136755)"..., 2913, 0, NULL, 0) = 2913 27010 09:59:15 recvfrom(6<UNIX:[1275946->1275428]>, "\7\0\0\1\0\0\0\2\0\0\0", 16384, 0, NULL, NULL) = 11 27010 09:59:15 msgrcv(163840, 0x7ffea1034830, 1024, 1, MSG_NOERROR) = -1 EIDRM (Identifier removed) 27010 09:59:15 sendto(7<UNIX:[1275429->432]>, "<11>Jan 20 09:59:15 ndo2db: Error: queue recv error.\n", 53, MSG_NOSIGNAL, NULL, 0) = 53 27010 09:59:15 msgrcv(163840, 0x7ffea1034830, 1024, 1, MSG_NOERROR) = -1 EINVAL (Invalid argument)

Let me explain the snipe. at the begining, after a while working properly, ndo get a message from kernel queue normaly (msgrcv) and insert that record in mysql but next time when ndo try to get another message we have this error:

27010 09:59:15 msgrcv(163840, 0x7ffea1034830, 1024, 1, MSG_NOERROR) = -1 EIDRM (Identifier removed) 27010 09:59:15 sendto(7<UNIX:[1275429->432]>, "<11>Jan 20 09:59:15 ndo2db: Error: queue recv error.\n", 53, MSG_NOSIGNAL, NULL, 0) = 53 27010 09:59:15 msgrcv(163840, 0x7ffea1034830, 1024, 1, MSG_NOERROR) = -1 EINVAL (Invalid argument)

We are still investigating but we are lost in this moment. Finally ¿do you fix the problem?

Many thanks!

tatref commented 3 years ago

Hi,

No the issue still happens from time to time. I never checked strace I will next time.

I also checked the queue size, and the size was at the max (kernel.msgmax).

Here you can see multiple queues, and the first one is full, but not drained anymore by ndo. The last queue (msqid 950276) is the current one. The others are probably old queues not deleted properly by ndo/nagios.

[root@xxx~]# ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0xff000002 425984     nagios     600        131072000    128000
0x77000002 393217     nagios     600        0            0
0x9d000002 524290     nagios     600        0            0
0x7c000002 917507     nagios     600        0            0
0xffffffff 950276     nagios     600        0            0

[root@xxx~]# cat /etc/sysctl.d/ndo.conf
kernel.msgmnb = 131072000
kernel.msgmax = 131072000
kernel.msgmni = 65536000

If you still have the issue, can you compare the queue id in ipcs -q with the msgrcv syscall?

jorgeroman commented 3 years ago

In our case we happens almost instantly when you start the service, but our nagios has a lot check per minutes. We have checked the queues but never reach the queue limit size (kernel.msgmnb = 524288000). In fact the problem begins with very few messages in the queue:

------ Message Queues -------- key msqid owner perms used-bytes messages 0xd7010002 4816896 nagios 600 18432 18

We are compiled ndo2db with valgrind in order to debug the problem and we realize that in any point the the ndo2db lost the queue ID and is when everything starts to fail. it seems like the queue was destroy a create again with different queue ID.

For example we start the process:

------ Message Queues -------- key msqid owner perms used-bytes messages 0x69018b3b 4849664 nagios 600 1887232 1843

------ Message Queues -------- key msqid owner perms used-bytes messages 0x69018b3b 4849664 nagios 600 2112512 2063

------ Message Queues -------- key msqid owner perms used-bytes messages 0x69018b3b 4849664 nagios 600 2460672 2403

.... everything is fine but suddenly..

------ Message Queues -------- key msqid owner perms used-bytes messages 0x69018b3b 4882432 nagios 600 10240 10

the queue id has change and start to fail.

In the strace we see the same:

9034 1611248784.347062 msgrcv(4849664, {1, "\n213:\n1=1202\n2=0\n3=0\n4=1611248746.589179\n53=edgeserver-graovh1\n114=Varnish Bad Things\n95=backend_fail:20 client_drop:0 n_wrk_drop:0 n_wrk_failed:0 ban_list:1987 bans_completed:786 bans_diff:1201 accept_fail:5\n125=\n99=backend_fail=20;;;0; client_drop=0;;;0; n_wrk_drop=0;;;0; n_wrk_failed=0;;;0; accept_fail=5;;;0; bans_completed=786;;;0; bans_diff=1201;;;0; n_ban=1987;;;0;\n27=0\n51=1\n115=1\n25=1\n76=4\n61=1611248743\n83=1611248803\n12=0\n63=1610642085\n57=1610642085\n56=0\n66=1611248743\n70=0\n67=0\n64=1610639890\n121=1\n6\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 1024, 1, MSG_NOERROR) = 1024 <0.000014>

and next time to try get a message from the queue try to get form queue id 4849664 but not longer exists:

9034 1611248784.350504 msgrcv(4849664, 0x7ffefc65e870, 1024, 1, MSG_NOERROR) = -1 EINVAL (Invalid argument) <0.000015> 9034 1611248784.350577 sendto(8, "<11>Jan 21 17:06:24 ndo2db: Error: queue recv error.\n", 53, MSG_NOSIGNAL, NULL, 0) = 53 <0.000044>

We are still investigating and if we have any other new let me update to you.

Many thanks!

tatref commented 3 years ago

Same here, but we only see the issue after a while, so the queue ends up filled in the meantime.

jorgeroman commented 3 years ago

Hi again,

After a lot of investigation we realized that it wasn't ndo2db fault, so it was to be another thing tnat it was removing the queue and creating a new one.

In the syslog log file we always see the same line before the error starts:

Jan 22 07:06:23 admin-eemgcp1 systemd[1]: Removed slice User Slice of nagios. Jan 22 07:06:23 admin-eemgcp1 ndo2db-4x: Error: queue recv error.

So we thought it could systemd fault and indeed it was, specifically RemoveIPC.

   RemoveIPC=
       Controls whether System V and POSIX IPC objects belonging to
       the user shall be removed when the user fully logs out. Takes
       a boolean argument. If enabled, the user may not consume IPC
       resources after the last of the user's sessions terminated.
       This covers System V semaphores, shared memory and message
       queues, as well as POSIX shared memory and message queues.
       Note that IPC objects of the root user and other system users
       are excluded from the effect of this setting. Defaults to
       "yes".

This setting by defaults is a yes.

So to fix the problem we have edited /etc/systemd/logind.conf and force RemoveIPC=no. Next restart to systemd-logind and restart nagios and everything is working fine now.

All the best

tatref commented 3 years ago

Nice found!

However in my case, 2 servers that had issues are running Centos 6, so no systemd here.

Also I check on Centos 7.9.2009, and the default value is

# Entries in this file show the compile time defaults.
#RemoveIPC=no

It seems that for Redhat-like OSes, the parameter is "no"

nook24 commented 3 years ago

I have created a fully working NDO drop-in replacement https://statusengine.org/oldstable/ It runs in production on many many instances and is very well tested.

Looks like there are no more major updates coming to the public ndoutils repository: https://github.com/NagiosEnterprises/ndoutils/issues/57#issuecomment-596603014