RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.94k stars 1.99k forks source link

shell/ping6: Incorrect handling of unexpected pongs #11519

Closed maribu closed 5 years ago

maribu commented 5 years ago

Description

ping6 does check received pongs properly

Steps to reproduce the issue

Expected results

Actual results

E.g. see the stats here:

> 2019-05-13 17:32:42,837 - INFO #  ping6 fe80::1114:7221:1336:102a
2019-05-13 17:32:42,845 - INFO # 12 bytes from fe80::1114:7221:1336:102a: icmp_seq=1 ttl=64 rssi=-61 dBm time=32.115 ms
2019-05-13 17:32:42,852 - INFO # 12 bytes from fe80::1114:7221:1336:102a: icmp_seq=0 ttl=64 rssi=-67 dBm time=7.957 ms
2019-05-13 17:32:42,861 - INFO # 12 bytes from fe80::1114:7221:1336:102a: icmp_seq=0 ttl=64 rssi=-61 dBm time=15.542 ms (DUP!)
2019-05-13 17:32:42,869 - INFO # 12 bytes from fe80::1114:7221:1336:102a: icmp_seq=0 ttl=64 rssi=-62 dBm time=23.820 ms (DUP!)
2019-05-13 17:32:42,877 - INFO # 12 bytes from fe80::1114:7221:1336:102a: icmp_seq=0 ttl=64 rssi=-62 dBm time=32.099 ms (DUP!)
2019-05-13 17:32:43,854 - INFO # 12 bytes from fe80::1114:7221:1336:102a: icmp_seq=1 ttl=64 rssi=-77 dBm time=9.959 ms
2019-05-13 17:32:43,855 - INFO # 
2019-05-13 17:32:43,859 - INFO # --- fe80::1114:7221:1336:102a PING statistics ---
2019-05-13 17:32:43,866 - INFO # 2 packets transmitted, 3 packets received, 3 duplicates, 2147483598% packet loss
2019-05-13 17:32:43,870 - INFO # round-trip min/avg/max = 7.957/20.248/32.115 ms

Versions

(At least) current master

maribu commented 5 years ago

@miri64: Care to take a look?

miri64 commented 5 years ago

Will do, but only after this week at the earliest. If it is pressing, feel free to fix it earlier.

benemorius commented 5 years ago

There is code to prevent this happening but it relies on id being unique per each execution of ping6.

It's generating id with luid_custom() which does not return unique values: https://github.com/RIOT-OS/RIOT/blob/6e9612f1d1e41ef56b79c1fcddc3d7e3aae2b9d7/sys/shell/commands/sc_gnrc_icmpv6_echo.c#L244

I replaced luid_custom() with luid_get() and it appears to have resolved the problem. Of course this still doesn't handle the case where MODULE_LUID is not available.

miri64 commented 5 years ago

Mh... I don't really get the problem. If there are duplicates ping (even the iputils implementation) shows them. If you spoof echo replies they are shown as well, but there is no way to distinguish spoofs from real if you use the correct addresses.

  • On A: ping6 node B and immediately afterwards ping6 some unreachable/unassigned IPv6 address

What does the latter accomplish? I should not get any responses, as there is no node to send them.

maribu commented 5 years ago
  • On A: ping6 node B and immediately afterwards ping6 some unreachable/unassigned IPv6 address

What does the latter accomplish? I should not get any responses, as there is no node to send them.

If B sends duplicates for some extended time, and A stops pinging B but starts pinging unavailable node C, A will receive the delayed responses of B. But as the sender address is not matching C, A should ignore the replies (or give a debug message). ping6 however takes the delayed responses into account and will stop pinging when 3 responses (regardless of sender address) are receiced. If it only send 2 requests in that time, it will also print 133% success ratio...

miri64 commented 5 years ago

If B sends duplicates for some extended time, and A stops pinging B but starts pinging unavailable node C, A will receive the delayed responses of B. But as the sender address is not matching C, A should ignore the replies (or give a debug message). ping6 however takes the delayed responses into account and will stop pinging when 3 responses (regardless of sender address) are receiced. If it only send 2 requests in that time, it will also print 133% success ratio...

Ahhh, so the sender address not matching is the problem?

miri64 commented 5 years ago

@maribu #11933 should fix the issue.

benemorius commented 5 years ago

@maribu #11933 should fix the issue.

It should fix 2 but it wouldn't fix 1:

1) On A: ping6 node B more than once 1) On A: ping6 node B and immediately afterwards ping6 some unreachable/unassigned IPv6 address

miri64 commented 5 years ago

2. On A: ping6 node B and immediately afterwards ping6 some unreachable/unassigned IPv6 address

Ok, but then I don't understand what is happening. If I ping an unreachable address, there should be no reply. As we now with #11933 check the source of the packet against the pinged address the second case should also be fixed. Setting the ID to a more node-specific one is in my eyes only patch-work to fix a sympton, but not the underlying issue (if two nodes have the same ID by chance the problem would still exist)

miri64 commented 5 years ago

I mean the implementation of our ping is largely based on the inetutil one and the original ping implementation (they just use the PID of the process) and they don't have these kind of problems.

miri64 commented 5 years ago

@maribu can you supply with the code of your modified node B?

benemorius commented 5 years ago

Yes the second case (pinging two different nodes) is fixed by source address checking, but not the first case (pinging the same node twice).

I don't actually know what is the intended use of id or whether it's supposed to be unique per node or unique per invocation. But if we make it unique per invocation then it helps solve case 1:

benemorius commented 5 years ago

they don't have these kind of problems

I agree and I think maybe it's worth noting that delayed pongs are required to trigger this issue, and Riot is probably the first OS I've encountered so many delayed pongs with, and that's maybe why it's an issue for us and not others. Usually it's just due to xtimer issues or other highly experimental stuff. In most normal cases and for most normal OSs I think it's likely that delayed pongs are so uncommon that this would be a nonissue.

miri64 commented 5 years ago

Ok, sorry then I misunderstood you. id should change "enough" to be distinguishable between invocations. As far as I understand luid_custom() this should be the case, but luid might be wrong there. I'll investigate.

miri64 commented 5 years ago

Mhhh indeed luid_custom() does not do what I thought it does. However, I think I go for a different approach then you propose, since a different ID is also not guaranteed if luid is not provided, so I think we need a global counter.

miri64 commented 5 years ago

(will provide a separate fix tomorrow)

maribu commented 5 years ago

@maribu can you supply with the code of your modified node B?

I used unmodified RIOT master with one node using an at86rf2xx (802.15.4 breakout header for RPi by OpenLabs) and a nrf52840-dk. I did not investigate further yet what the problem is yet, so I cannot provide any details.

miri64 commented 5 years ago

But your steps to reproduce say

  • Modify device B to send multiple responses over some extended period of time (e.g. seconds)
miri64 commented 5 years ago

(will provide a separate fix tomorrow)

See https://github.com/RIOT-OS/RIOT/pull/11938

maribu commented 5 years ago

But your steps to reproduce say

  • Modify device B to send multiple responses over some extended period of time (e.g. seconds)

Yes, and this is how I would reproduce the issue in the most generalized and controlled way ;-) I could provide a small program to trigger the issue using e.g. Linux RAW sockets. It would require node B to be a Linux host and node A to use e.g. Ethernet. But I won't have time for that until next week.

benemorius commented 5 years ago

I think it should be possible to use a debugger to halt the CPU either during or after a ping reception and then unhalt it to release the pong at a controlled time.

miri64 commented 5 years ago

I think it should be possible to use a debugger to halt the CPU either during or after a ping reception and then unhalt it to release the pong at a controlled time.

I did not test it that way, but whoever reviews #11938 could do so :-)

miri64 commented 5 years ago

Apparently, just rewording the OP of the PR was not enough.

miri64 commented 5 years ago

@benemorius @maribu please check if this is still an issue with #11933 and #11938 being merged.

maribu commented 5 years ago

Will do in a few hours hopefully. (Sorry for being so unresponsive. This week is super busy :-/)

Thanks for your fixes!

maribu commented 5 years ago

With current RIOT master sending pings using an nrf52840dk to a Nucleo-F767ZI (with disabled stm32-eth) supplied with an AT86RF233 (OpenLabs RPi 802.15.4 breakout board):

``` > ping6 fe80::1114:7221:1327:102a 2019-08-09 22:22:58,708 - INFO # ping6 fe80::1114:7221:1327:102a 2019-08-09 22:22:58,730 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-45 dBm time=14.173 ms 2019-08-09 22:22:58,738 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-41 dBm time=21.842 ms (DUP!) 2019-08-09 22:22:58,746 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-41 dBm time=30.122 ms (DUP!) 2019-08-09 22:22:58,754 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-41 dBm time=38.404 ms (DUP!) 2019-08-09 22:22:59,731 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=1 ttl=64 rssi=-45 dBm time=15.761 ms 2019-08-09 22:22:59,739 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=1 ttl=64 rssi=-41 dBm time=23.434 ms (DUP!) 2019-08-09 22:22:59,747 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=1 ttl=64 rssi=-41 dBm time=31.714 ms (DUP!) 2019-08-09 22:22:59,756 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=1 ttl=64 rssi=-41 dBm time=39.995 ms (DUP!) 2019-08-09 22:23:00,729 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=2 ttl=64 rssi=-45 dBm time=14.172 ms 2019-08-09 22:23:00,730 - INFO # 2019-08-09 22:23:00,734 - INFO # --- fe80::1114:7221:1327:102a PING statistics --- 2019-08-09 22:23:00,740 - INFO # 3 packets transmitted, 3 packets received, 6 duplicates, 0% packet loss 2019-08-09 22:23:00,745 - INFO # round-trip min/avg/max = 14.172/25.513/39.995 ms > ping6 fe80::1114:7221:1327:102a 2019-08-09 22:23:03,651 - INFO # ping6 fe80::1114:7221:1327:102a 2019-08-09 22:23:03,659 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=2 ttl=64 rssi=-40 dBm time=2943.920 ms 2019-08-09 22:23:03,673 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-46 dBm time=14.483 ms 2019-08-09 22:23:03,681 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-42 dBm time=22.153 ms (DUP!) 2019-08-09 22:23:03,690 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-42 dBm time=30.433 ms (DUP!) 2019-08-09 22:23:03,698 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=0 ttl=64 rssi=-42 dBm time=38.713 ms (DUP!) 2019-08-09 22:23:04,673 - INFO # 12 bytes from fe80::1114:7221:1327:102a: icmp_seq=1 ttl=64 rssi=-46 dBm time=14.481 ms 2019-08-09 22:23:04,674 - INFO # 2019-08-09 22:23:04,678 - INFO # --- fe80::1114:7221:1327:102a PING statistics --- 2019-08-09 22:23:04,685 - INFO # 2 packets transmitted, 3 packets received, 3 duplicates, 2147483598% packet loss 2019-08-09 22:23:04,690 - INFO # round-trip min/avg/max = 14.481/510.697/2943.920 ms ```

(Notice how ping6 gets confused on the second run.)

Output of the Nucleo (debug in the AT86RF233 driver enabled):

``` 2019-08-09 22:22:21,710 - INFO # �pan0: 35, pan1: 0 2019-08-09 22:22:21,712 - INFO # set option 128 to 1 2019-08-09 22:22:21,715 - INFO # [at86rf2xx] opt: enable auto ACKs 2019-08-09 22:22:21,718 - INFO # set option 16 to 1 2019-08-09 22:22:21,723 - INFO # [at86rf2xx] opt: enabling CSMA mode(4 retries, min BE: 3 max BE: 5) 2019-08-09 22:22:21,727 - INFO # [at86rf2xx] opt: Set CSMA seed to 0x13 0x14 2019-08-09 22:22:21,730 - INFO # [at86rf2xx] opt: Set CSMA retries to 4 2019-08-09 22:22:21,734 - INFO # [at86rf2xx] opt: Set min BE=3, max BE=5 2019-08-09 22:22:21,737 - INFO # at86rf2xx_reset(): reset complete. 2019-08-09 22:22:21,738 - INFO # set option 8 to 1 2019-08-09 22:22:21,740 - INFO # set option 2 to 1 2019-08-09 22:22:21,746 - INFO # [at86rf2xx] EVT - TX_END 2019-08-09 22:22:21,748 - INFO # [at86rf2xx] TX SUCCESS 2019-08-09 22:22:21,751 - INFO # [at86rf2xx] return to idle state 0x16 2019-08-09 22:22:21,754 - INFO # [at86rf2xx] EVT - TX_END 2019-08-09 22:22:21,756 - INFO # [at86rf2xx] TX SUCCESS 2019-08-09 22:22:21,763 - INFO # main(): This is RIOT! (Version: 2019[at86rf2xx] return to idle state 0x16 2019-08-09 22:22:21,765 - INFO # [at86rf2xx] EVT - TX_END 2019-08-09 22:22:21,767 - INFO # [at86rf2xx] TX SUCCESS 2019-08-09 22:22:21,768 - INFO # . 2019-08-09 22:22:21,769 - INFO # 0-devel-369-gb48afd) 2019-08-09 22:22:21,773 - INFO # RIOT network stack example application 2019-08-09 22:22:21,775 - INFO # All up, running the shell now > 2019-08-09 22:22:29,388 - INFO # [at86rf2xx] EVT - RX_END 2019-08-09 22:22:29,396 - INFO # [at86rf2xx] LQI:255 high is good, RSSI:-52 high is either good ortoo much interference. 2019-08-09 22:22:31,764 - INFO # [at86rf2xx] return to idle state 0x16 2019-08-09 22:22:31,766 - INFO # [at86rf2xx] EVT - TX_END 2019-08-09 22:22:31,768 - INFO # [at86rf2xx] TX SUCCESS 2019-08-09 22:22:39,388 - INFO # [at86rf2xx] EVT - RX_END 2019-08-09 22:22:39,396 - INFO # [at86rf2xx] LQI:255 high is good, RSSI:-49 high is either good ortoo much interference. 2019-08-09 22:22:41,764 - INFO # [at86rf2xx] return to idle state 0x16 2019-08-09 22:22:41,766 - INFO # [at86rf2xx] EVT - TX_END 2019-08-09 22:22:41,769 - INFO # [at86rf2xx] TX SUCCESS 2019-08-09 22:22:53,388 - INFO # [at86rf2xx] EVT - RX_END 2019-08-09 22:22:53,396 - INFO # [at86rf2xx] LQI:255 high is good, RSSI:-53 high is either good ortoo much interference. 2019-08-09 22:22:55,763 - INFO # [at86rf2xx] return to idle state 0x16 2019-08-09 22:22:55,766 - INFO # [at86rf2xx] EVT - TX_END 2019-08-09 22:22:55,768 - INFO # [at86rf2xx] TX SUCCESS 2019-08-09 22:22:58,712 - INFO # [at86rf2xx] EVT - RX_END 2019-08-09 22:22:58,719 - INFO # [at86rf2xx] LQI:255 high is good, RSSI:-51 high is either good ortoo much interference. 2019-08-09 22:22:58,735 - INFO # [at86rf2xx] return to idle state 0x16 2019-08-09 22:22:58,737 - INFO # [at86rf2xx] EVT - TX_END ```

The potential "driver issue" was as simple to solve as running the following to commands:

> ifconfig 7 -ack_req
> ifconfig 7 -autoack

on the Nucleo with the AT86RF233.

benemorius commented 5 years ago

Sorry I didn't manage to test any of that until now. My result agrees with @maribu: id has become unique per invocation now but the /* not our ping */ check fails to reject it.

It looks like #11938 worked correctly to reject same-node pongs with different ids but then #11933 broke the logic with a typo that only rejects a pong if it has both a wrong id and a wrong address: https://github.com/RIOT-OS/RIOT/blob/4a723fe4fd7a8f3d8a3376ad9a4bbabf396d11be/sys/shell/commands/sc_gnrc_icmpv6_echo.c#L341

On master if I change && to || I get the expected result that same-node pongs and wrong source address pongs are ignored.

To test this in a controlled setting I captured and injected ping replies with a Linux node connected via a border router to a Riot node using this procedure: 1) Prepare Linux node to capture its own outgoing ping reply: tcpdump -i eth0 "icmp6 && ip6[40] == 129" -n -c1 -w icmp-reply.pcap 1) Use the border router to block the genuine ping reply from reaching the Riot node: ip6tables -I FORWARD -o lowpan0 -p icmpv6 --icmpv6-type echo-reply -j DROP 1) Ping Linux node from Riot node using 1 ping and a 60 second timeout: ping6 2001:470:4bb0:ffff::1 -c 1 -W 60000 1) On Linux node modify the source address of the captured packet: tcprewrite --infile=icmp-reply.pcap --outfile=icmp-reply2.pcap --pnat=[2001:470:4bb0:ffff::1]:[2001:470:4bb0:ffff::2] 1) On border router unblock ping replies: ip6tables -D FORWARD -o lowpan0 -p icmpv6 --icmpv6-type echo-reply -j DROP 1) On Linux node send modified ping reply: tcpreplay --intf1=eth0 icmp-reply2.pcap 1) Observe whether the Riot node prints the reply. It should not. 12 bytes from 2001:470:4bb0:ffff::2 id:0xbe45/0xbe45 icmp_seq=0 ttl=63 rssi=-85 dBm time=13823.374 ms 1) As before, block ping replies at the border router: ip6tables -I FORWARD -o lowpan0 -p icmpv6 --icmpv6-type echo-reply -j DROP 1) As before, send a ping from the Riot node: ping6 2001:470:4bb0:ffff::1 -c 1 -W 60000 1) As before, unblock ping replies: ip6tables -D FORWARD -o lowpan0 -p icmpv6 --icmpv6-type echo-reply -j DROP 1) From the Linux node send the unmodified ping reply captured earlier: tcpreplay --intf1=eth0 icmp-reply.pcap 1) Observe whether the Riot node prints the reply. It should not. 12 bytes from 2001:470:4bb0:ffff::1 id:0xbe45/0x2e08 icmp_seq=0 ttl=63 rssi=-86 dBm time=52156.860 ms

At 7 and 12 Riot should not print the ping reply. But on master it prints both 7 and 12. After changing && to || on master, 7 and 12 both stop printing.

fjmolinas commented 5 years ago

@maribu @benemorius Thanks for testing, it seems this was some wrongful due diligence on my side when testing #11933, I was also able to reproduce the findings and test the proposed fix. Do you wan't to open the PR yourself @miri64?

maribu commented 5 years ago

Has someone already opened a PR to change the && to ||?

fjmolinas commented 5 years ago

@maribu Im AFK for the next two weeks, do you think you or @benemorius can open the PR and review it?

maribu commented 5 years ago

@benemorius: Mind to open the PR? I'll test and review then.