False positives when the pinging machine receives ARP requests at the same time

martinvonwittich commented 5 years ago

We use arping on customer systems to ensure that the server IP isn't used by any other devices in the LAN. The command we use looks like this:

arping -r -c1 -C2 -w20000 -i INTERFACE IP

On one customer system, we've encountered a false positive - arping claims that the IP is used by the server itself:

server ~ # arping -r -c1 -C2 -w20000 -i eno1 192.168.67.2   
ac:1f:6b:79:04:0c
server ~ # ifconfig eno1
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.67.2  netmask 255.255.255.0  broadcast 192.168.67.255
        inet6 fe80::ae1f:6bff:fe79:40c  prefixlen 64  scopeid 0x20<link>
        ether ac:1f:6b:79:04:0c  txqueuelen 1000  (Ethernet)
        RX packets 24071813  bytes 32854407085 (30.5 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 13176299  bytes 1780223344 (1.6 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0xc7200000-c727ffff

(IMO this shouldn't happen, because the server doesn't actually respond to its own request, as you can see in the tshark output below.)

With tshark I was able to figure out that whenever the server sends an ARP lookup for its own IP, the DSL router automatically responds with its own ARP lookup for the server's IP, which the server then responds to:

server ~ # tshark -i eno1 -f 'arp' -n                                                    
Running as user "root" and group "root". This could be dangerous.                                        
tshark: Lua: Error during loading:                                                                    
 /usr/share/wireshark/init.lua:32: dofile has been disabled due to running Wireshark as superuser. See https://wiki.wireshark.org/CaptureSetup/CapturePrivileges for help in ru
nning Wireshark as an unprivileged user.                                                              
Capturing on 'eno1'                                                                                      
    1 0.000000000 ac:1f:6b:79:04:0c → ff:ff:ff:ff:ff:ff ARP 42 Gratuitous ARP for 192.168.67.2 (Request)
    2 0.000767760 20:f3:a3:80:2d:ad → ff:ff:ff:ff:ff:ff ARP 60 Who has 192.168.67.2? Tell 192.168.67.1   
    3 0.000778253 ac:1f:6b:79:04:0c → 20:f3:a3:80:2d:ad ARP 42 192.168.67.2 is at ac:1f:6b:79:04:0c

arping is apparently confused by this and believes that the response (frame 3) to the DSL router's request (frame 2) is actually a response to its own request (frame 1).

This problem is easily reproducible by having one arping instance ping its own server, and then another arping instance on another server pinging the first server. For example, when I run this command on my test server to ping itself, it doesn't get any responses (as expected):

martin ~/arping/src (arping-2.x) # ifconfig enp1s0
enp1s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.56.10  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 2003:a:422:3b00:56::10  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::acff:fe11:fb18  prefixlen 64  scopeid 0x20<link>
        ether 02:00:ac:11:fb:18  txqueuelen 1000  (Ethernet)
        RX packets 1796471  bytes 966662387 (921.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1290814  bytes 1470807952 (1.3 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

martin ~/arping/src (arping-2.x) # arping -i enp1s0 172.17.56.10
ARPING 172.17.56.10
Timeout
Timeout
Timeout

But when I then run the following command on another server to ping my server:

another-server ~ # arping -i eth1 172.17.56.10
ARPING 172.17.56.10
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=0 time=14.506 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=1 time=5.474 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=2 time=636.879 usec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=3 time=7.839 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=4 time=15.339 msec
^C
--- 172.17.56.10 statistics ---
5 packets transmitted, 5 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 0.637/8.759/15.339/5.549 ms

Then suddenly the arping on my server shows responses:

martin ~/arping/src (arping-2.x) # arping -i enp1s0 172.17.56.10
ARPING 172.17.56.10
Timeout
Timeout
Timeout
Timeout
Timeout
Timeout
Timeout
Timeout
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=0 time=739.215 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=1 time=750.790 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=2 time=746.490 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=3 time=738.272 msec
42 bytes from 02:00:ac:11:fb:18 (172.17.56.10): index=4 time=745.738 msec
Timeout
Timeout
^C
--- 172.17.56.10 statistics ---
16 packets transmitted, 5 packets received,  69% unanswered (0 extra)
rtt min/avg/max/std-dev = 738.272/744.101/750.790/4.711 ms

ThomasHabets commented 5 years ago

Thank you for a very thorough bug report!

First of all I should ask why you say this is a false positive. Isn't it a false negative? The IP address is being used, just by the pinging server as a special case. It becomes a false negative unless another machine on the network sends ARP requests (and thus gets replies), right?

The behaviour you see is because as you may already know arping doesn't check the source MAC of the reply. That would have been around here.

Now for the question: Should it? Maybe yes. If it does then it'll be unaffected by other machine's ARP request, and I agree this is probable the expected result. Maybe accept either requestor's MAC or the broadcast address.

Normally that should already be the behavior because promiscuous mode is off by default (-p option), but there appears to be a special case when it's the local host sending it out.

But yes, it does seem like checking that the dst mac is the same as the outgoing request mac is the right thing to do.

I'll be a bit busy over the next few weeks, but I'll get to fixing this. (also pull requests welcome. Defaulting to new behaviour is fine, but I'll want a flag to let it continue to accept any reply. Or maybe the existing -p is enough)

martinvonwittich commented 5 years ago

First of all I should ask why you say this is a false positive. Isn't it a false negative? The IP address is being used, just by the pinging server as a special case. It becomes a false negative unless another machine on the network sends ARP requests (and thus gets replies), right?

Hmm, I'm not sure, but I would argue that it is a false positive. The normal behavior IMO would be:

arping sends out an ARP request:

test ~ # arping -r -c1 -C2 -w20000 -i enp1s0 172.17.56.10
test ~ #

The ARP request is sent out to the network:

test ~ # tshark -i enp1s0 -f arp
Running as user "root" and group "root". This could be dangerous.
tshark: Lua: Error during loading:
 /usr/share/wireshark/init.lua:32: dofile has been disabled due to running Wireshark as superuser. See https://wiki.wireshark.org/CaptureSetup/CapturePrivileges for help in running
Wireshark as an unprivileged user.
Capturing on 'enp1s0'
    1 0.000000000 02:00:ac:11:fb:18 → Broadcast    ARP 42 Gratuitous ARP for 172.17.56.10 (Request)
^C1 packet captured

As long as there is no network problem (e.g. a loop), this packet isn't received by the sending host itself, and so Linux won't respond to this request because it originates from the local host.

This is a true negative - there is not response, and therefore arping doesn't show any replies and exits with a non-zero exit code.

In the problem I've outlined above, there is no real response either - arping just confuses a response from the local host directed to another machine as a response for its own request, and now incorrectly reports this as a response and exit with exit code 0. I would therefore call this a false positive :)

Now for the question: Should it? Maybe yes. If it does then it'll be unaffected by other machine's ARP request, and I agree this is probable the expected result.

I would also argue that this should be the default behavior, yes. I've considered suggesting a command-line switch to enable the new behavior, but the old behavior seems so wrong that I cannot imagine anyone actually would expect it :D

I've found mention of another customer server in our internal bug tracker, and there the problem is even more extreme. Apparently they have a lot of Cisco devices that immediately respond to ARP lookups from our server:

other-customer-server ~ # arping -c1 10.0.0.1
ARPING 10.0.0.1
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=0 time=11.446 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=1 time=11.463 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=2 time=11.469 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=3 time=11.474 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=4 time=11.478 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=5 time=11.483 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=6 time=11.487 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=7 time=11.495 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=8 time=11.500 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=9 time=11.506 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=10 time=11.510 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=11 time=11.515 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=12 time=11.522 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=13 time=11.527 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=14 time=11.532 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=15 time=11.537 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=16 time=11.543 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=17 time=11.548 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=18 time=11.553 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=19 time=11.558 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=20 time=11.563 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=21 time=11.568 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=22 time=11.572 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=23 time=11.577 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=24 time=11.584 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=25 time=11.589 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=26 time=11.594 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=27 time=11.599 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=28 time=11.604 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=29 time=11.609 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=30 time=11.614 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=31 time=11.619 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=32 time=11.624 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=33 time=11.628 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=34 time=11.633 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=35 time=11.638 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=36 time=11.643 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=37 time=11.648 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=38 time=11.653 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=39 time=11.658 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=40 time=11.663 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=41 time=11.668 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=42 time=11.673 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=43 time=11.678 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=44 time=11.682 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=45 time=11.687 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=46 time=11.693 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=47 time=11.698 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=48 time=11.702 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=49 time=11.708 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=50 time=11.713 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=51 time=11.718 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=52 time=27.428 msec
42 bytes from 4c:ed:fb:91:2a:c6 (10.0.0.1): index=53 time=523.472 msec

--- 10.0.0.1 statistics ---
1 packets transmitted, 54 packets received,   0% unanswered (53 extra)
rtt min/avg/max/std-dev = 11.446/21.362/523.472/69.003 ms

I'll be a bit busy over the next few weeks, but I'll get to fixing this.

Take your time, it's not really that significant of a problem. I've told our developers that they probably just should filter the server's own IPs from the response, that should solve the problem for us.

ThomasHabets / arping

False positives when the pinging machine receives ARP requests at the same time #32