Linuxfabrik / monitoring-plugins

220+ check plugins for Icinga and other Nagios-compatible monitoring applications. Each plugin is a standalone command line tool (written in Python) that provides a specific type of check.
https://linuxfabrik.ch
The Unlicense
214 stars 49 forks source link

ping: Check plugin to fast states "error" #691

Closed Beleggrodion closed 10 months ago

Beleggrodion commented 1 year ago

This issue respects the following points:

Which variant of the Monitoring Plugins do you use?

Bug description

I don't know if its directly a bug or more a problem of understanding.

Currently i setup a new nagios instance (with the new icingadb instead of monitoring web plugin) . On the old instance we had the problem that often wrong ping (in my opinion) "down" messages are reported. So on the new instance i switched to your ping check plugin (one other reason was more performance data info). But also the problem exists there too. For comparison, tools like "smokeping" with 10 pings each 30 secs don't see any packet loss, etc. Also fping as probe don't see any issue. (smokeping uses fping)

fping: grafik

ping (i called it ping-ng because ping is the default ping of icinga) grafik

or here for comparisson smokeping output: grafik

So for me, and thats why it looks like a bug is, that if only 1 packet is missing (i can set the count high, etc.) or one has type "error" the state directly "Destination Host Unreachable" but the host is reachable, and working. Also other services don't had problems at all. So it switch between "Critical" and "Ok" without a "Warning" state or that this is realy critical because all other services don't display any issue.

Steps to reproduce - Plugin call

/usr/lib64/nagios/plugins/ping -H 192.168.1.1 --interval=0.2 --count=5 --timeout=5

Steps to reproduce - Data

Destination Host Unreachable. PING 192.168.1.1: 5 packets transmitted, 4 received, +1 errors, 20% packet loss, time 804ms. rtt min/avg/max/mdev = 8.457/8.532/8.600/0.052 ms|'transmitted'=5;;;0; 'received'=4;;;0; 'duplicates'=0;;;0; 'checksum_corrupted'=0;;;0; 'errors'=1;;;0; 'packet_loss'=20%;;;0;100 'time'=804ms;;;0; 'rtt_min'=8.457ms;;;0; 'rtt_avg'=8.532ms;;;0; 'rtt_max'=8.600ms;;;0; 'rtt_mdev'=0.052ms;;;0;

Environment

Linux mon01.rma01.4s-rma.intra 5.15.0-75-generic #82-Ubuntu SMP Tue Jun 6 23:10:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Plugin Version

ping: v2023051201 by Linuxfabrik GmbH, Zurich/Switzerland

Python version

No response

List of Python modules

No response

Additional Information

No response

markuslf commented 1 year ago

The ping check is very tolerant - if you send 5 packets, it will even tolerate 4 packet losses (80%), so the check is usually satisfied if there is at least one response to a packet.

But what it will not tolerate is any return code from the built-in ping command other than "0". You have a transmission error (+1 errors), which is reported by ping and therefore also by the plugin.

BTW, the native ping command returns "1" if "(it does not get any packets back) or (number of packets sent is lower than number of packets received and deadline has been reached)".

What does the native ping -c 5 -i 0.2 -w 5 -q 192.168.1.1 tell you?

Beleggrodion commented 1 year ago

The native ping gives me in the most returns cases a normal output as expected, i pressed your command several times until i receive a error. ( need to press 10-20 until a error accoured)

# ping -c 5 -i 0.2 -w 5 -q 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 803ms
rtt min/avg/max/mdev = 0.721/0.808/0.937/0.074 ms

and when an error occour:

# ping -c 5 -i 0.2 -w 5 -q 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.

--- 192.168.1.1 ping statistics ---
3 packets transmitted, 2 received, +1 errors, 33.3333% packet loss, time 403ms
rtt min/avg/max/mdev = 0.667/0.918/1.169/0.251 ms
# echo $?
1

And here is a tcpdump from the above generated error on the same machine:

14:48:38.214078 IP 192.168.1.201 > 192.168.1.1: ICMP echo request, id 100, seq 1, length 64
14:48:38.214561 IP 192.168.1.1 > 192.168.1.201: ICMP echo reply, id 100, seq 1, length 64
14:48:38.415455 IP 192.168.1.201 > 192.168.1.1: ICMP echo request, id 100, seq 2, length 64
14:48:38.416349 IP 192.168.1.1 > 192.168.1.201: ICMP echo reply, id 100, seq 2, length 64
14:48:38.619384 IP 192.168.1.201 > 192.168.1.1: ICMP echo request, id 100, seq 3, length 64
14:48:38.619763 IP 192.168.1.1 > 192.168.1.201: ICMP echo reply, id 100, seq 3, length 64
14:48:38.823458 IP 192.168.1.201 > 192.168.1.1: ICMP echo request, id 100, seq 4, length 64
14:48:38.823959 IP 192.168.1.1 > 192.168.1.201: ICMP echo reply, id 100, seq 4, length 64
14:48:39.027421 IP 192.168.1.201 > 192.168.1.1: ICMP echo request, id 100, seq 5, length 64
14:48:39.027824 IP 192.168.1.1 > 192.168.1.201: ICMP echo reply, id 100, seq 5, length 64
markuslf commented 1 year ago

There is some kind of error on your network: Specifying '-w 5' means that ping will stop after 5 seconds regardless of how many '-c' packets have been sent or received. In your case, ping does not stop after sending '-c 5' packets, it waits either for the timer to expire, or for count probes to be answered, or for some error notification from the network.

In other words: In your network, 5 burst pings take more than 5 seconds to respond. For the check, this is the reason to throw a critical, because it could mean that the target host is offline.

You could try increasing the '--timeout' parameter of the check (check-plugins/ping/README.rst), but you should look for the underlying cause.

Beleggrodion commented 1 year ago

I tested it with extrem increased timeout and also their the error occured again. And for me it look like that ping change something in its internal handling as soon as the '-w XX' is set. because when i use ping without any parameter:

~$ ping 212.x.x.x
PING 212.x.x.x (212.x.x.x) 56(84) bytes of data.
64 bytes from 212.x.x.x: icmp_seq=1 ttl=254 time=0.747 ms
64 bytes from 212.x.x.x: icmp_seq=2 ttl=254 time=0.520 ms
64 bytes from 212.x.x.x: icmp_seq=3 ttl=254 time=0.545 ms
64 bytes from 212.x.x.x: icmp_seq=4 ttl=254 time=0.514 ms
64 bytes from 212.x.x.x: icmp_seq=5 ttl=254 time=0.550 ms
64 bytes from 212.x.x.x: icmp_seq=6 ttl=254 time=0.529 ms
From 192.168.2.2 icmp_seq=7 Redirect Network(New nexthop: 192.168.2.1)
64 bytes from 212.x.x.x: icmp_seq=7 ttl=254 time=0.519 ms
64 bytes from 212.x.x.x: icmp_seq=8 ttl=254 time=0.496 ms
64 bytes from 212.x.x.x: icmp_seq=9 ttl=254 time=0.514 ms
64 bytes from 212.x.x.x: icmp_seq=10 ttl=254 time=0.532 ms
64 bytes from 212.x.x.x: icmp_seq=11 ttl=254 time=0.570 ms
64 bytes from 212.x.x.x: icmp_seq=12 ttl=254 time=0.575 ms
64 bytes from 212.x.x.x: icmp_seq=13 ttl=254 time=0.547 ms
64 bytes from 212.x.x.x: icmp_seq=14 ttl=254 time=0.552 ms

And soon as i set the parameter:

~$ ping 212.x.x.x -w 50
PING 212.x.x.x (212.x.x.x) 56(84) bytes of data.
64 bytes from 212.x.x.x: icmp_seq=1 ttl=254 time=0.704 ms
64 bytes from 212.x.x.x: icmp_seq=2 ttl=254 time=0.502 ms
64 bytes from 212.x.x.x: icmp_seq=3 ttl=254 time=0.432 ms
64 bytes from 212.x.x.x: icmp_seq=4 ttl=254 time=0.459 ms
64 bytes from 212.x.x.x: icmp_seq=5 ttl=254 time=0.565 ms
64 bytes from 212.x.x.x: icmp_seq=6 ttl=254 time=0.543 ms
64 bytes from 212.x.x.x: icmp_seq=7 ttl=254 time=0.512 ms
64 bytes from 212.x.x.x: icmp_seq=8 ttl=254 time=0.650 ms
64 bytes from 212.x.x.x: icmp_seq=9 ttl=254 time=0.555 ms
64 bytes from 212.x.x.x: icmp_seq=10 ttl=254 time=0.594 ms
64 bytes from 212.x.x.x: icmp_seq=11 ttl=254 time=0.574 ms
64 bytes from 212.x.x.x: icmp_seq=12 ttl=254 time=0.566 ms
64 bytes from 212.x.x.x: icmp_seq=13 ttl=254 time=0.560 ms
64 bytes from 212.x.x.x: icmp_seq=14 ttl=254 time=0.578 ms
64 bytes from 212.x.x.x: icmp_seq=15 ttl=254 time=0.628 ms
64 bytes from 212.x.x.x: icmp_seq=16 ttl=254 time=0.515 ms
64 bytes from 212.x.x.x: icmp_seq=17 ttl=254 time=0.555 ms
64 bytes from 212.x.x.x: icmp_seq=18 ttl=254 time=0.438 ms
64 bytes from 212.x.x.x: icmp_seq=19 ttl=254 time=0.616 ms
64 bytes from 212.x.x.x: icmp_seq=20 ttl=254 time=0.511 ms
64 bytes from 212.x.x.x: icmp_seq=21 ttl=254 time=0.492 ms
64 bytes from 212.x.x.x: icmp_seq=22 ttl=254 time=0.470 ms
64 bytes from 212.x.x.x: icmp_seq=23 ttl=254 time=0.530 ms
From 192.168.2.2 icmp_seq=24 Redirect Network(New nexthop: 192.168.2.1)

--- 212.x.x.x ping statistics ---
24 packets transmitted, 23 received, +1 errors, 4.16667% packet loss, time 23531ms
rtt min/avg/max/mdev = 0.432/0.545/0.704/0.065 ms

So if no -w parameter is set ping runs and runs but and outputs the "Redirect Network" message but display that he still received this ping answer. But when -w is set, ping output the message "Redirect Network" and then abort with +1 error and don't process further.

I checked the last days what occures this message and why ping has a problem with it, because all other services has no problems and on the network setup wasn't changed anything since last year. The decision to use 192.168.1.1 as obfuscated ip was not so a good idea for represent our network setup. I though keep it simple. So i changed it a bit in the output above to display that there is mean a "public" ip . the public ip of modem which connects to the vdsl line.

The 192.168.2.2 is a cisco switch which is the gateway and has setup different routes for different subnets because we had multiple firewalls here for different purposes.

The "Redirect Network" message as i read in some docs, says that some asynchronous rounting happens but don't see anything in this direction in the setup (I'ts a bit complex but i checked it the last days)

At the moment i only see the possible solution to switch to fping for testing, because it looks like that ping command don't have the same issue with ICMP Redirects as the normal ping under linux (tested it with centos, almalinux and ubuntu).

markuslf commented 1 year ago

I will try to reproduce this.

markuslf commented 10 months ago

We don't rely on ping's return code any longer, making this check-plugin more tolerant.